Thu's Project Guide

Project: OPT Pal — H-1B Employer Data Hub Category: Web App (Flask) + Data Science Last updated: April 21

Note: This guide reflects the latest state of your project repo. It may not match the most up-to-date version if you've worked since.

Where You Are

Big jump from last week — real scaffolding is in the repo now:

helpers.py loads the USCIS CSV, cleans it (renames the long Excel headers to short codes, drops blank employers, coerces counts to numbers), and writes it into SQLite via import_data_into_SQL.py. That's legit data-engineering work.
app.py has one route that reads ?industry= and calls search_employers_by_industry, which runs a parameterized LIKE query against employers in SQLite.
templates/home.html renders a table of results with a styled search form.
Your project.spec.md is written and already names charts + compare + multi-year as stretch goals.

What's missing for the pivot we talked about: the app today is a single text filter. The plan now is to reshape it into a short onboarding quiz that narrows the search, then a dashboard with charts and (eventually) a US map. That's a much better demo-day story for "OPT Pal."

Two small things to fix in passing:

search_employers_by_industry currently lives in helpers.py next to cleaning. Search is business logic — it belongs in search.py on its own.
project.journal.md is still the template. Checkpoint 1 + Checkpoint 2 entries need to go in by Thursday.

Project Structure

Your project splits into two kinds of code:

Business logic — you handwrite this. Everything in search.py (NEW): the filter function the quiz drives, and the aggregate functions the dashboard reads. These ARE your app — the decisions about how someone narrows 12k employers and what numbers they see when they land on the dashboard.
Library / view code — agent-assisted is fine. Flask routes in app.py (they just read query params and pass dicts to templates), Jinja templates, Chart.js / Plotly.js snippets, CSS.
Keep as-is → helpers.py stays as your cleaning module (load_raw, clean_employers, load_clean). No search logic in there.

Target layout by Thursday:

final-project-huynth4/
├── app.py                          ← Flask routes — agent-assisted OK
├── helpers.py                      ← cleaning only — already done
├── search.py                       ← business logic — handwrite (NEW, yours to own)
├── import_data_into_SQL.py         ← already done
├── pyproject.toml
├── templates/
│   ├── base.html
│   ├── quiz.html                   ← NEW — the onboarding wizard
│   └── dashboard.html              ← NEW — charts + stats + results
├── static/
│   └── style.css
├── instance/
│   └── opt_pal.db
└── data/
    └── Employer Information.csv

Quiz flow

To setup your quiz, we'll create a sequence of URL routes that will guide the user through a series of questions leading to a dashboard with helpful information.

Home "/"  →  /quiz?step=1  →  /quiz?step=2  →  /quiz?step=3  →  /quiz?step=4  →  /dashboard
             industry         state            city (opt)       company size     charts + stats + results table

Each quiz step passes the prior answers forward as hidden inputs in the form, so state lives in the URL — no Flask session needed.

Phase 1: Split cleaning vs search — create `search.py`

Handwrite this yourself.

Objective

Move search_employers_by_industry out of helpers.py into a new search.py. helpers.py becomes cleaning-only.

Instructions

Create search.py at the project root
Move the DB_PATH constant and the search_employers_by_industry function from helpers.py into search.py
Delete them from helpers.py
Update the import at the top of app.py: from search import search_employers_by_industry
Run uv run flask --app app run --debug and confirm nothing broke

Hints

Top of search.py:

"""Search + aggregate functions for OPT Pal.

All SQLite queries live here. No Flask imports.
"""

from pathlib import Path
import sqlite3

DB_PATH = Path(__file__).resolve().parent / "instance" / "opt_pal.db"


def _connect():
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    return conn

The _connect() helper means every function in search.py opens the DB the same way. The leading underscore is a Python convention for "private to this module."

Optional — get help from your agent:

Walk me through the difference between helpers.py (cleaning) and
search.py (queries). Why separate them? Don't change my code —
just explain the reasoning.

Phase 2: Generalize the filter — `search_employers(filters)`

Write most of this yourself.

Objective

Replace search_employers_by_industry(user_input) with search_employers(filters) that accepts any combination of industry, state, city, and size and builds one parameterized SQL query.

Instructions

In search.py, write search_employers(filters, limit=50) that returns a list of dicts
Support these keys in filters: industry (substring), state (exact, uppercased), city (substring, uppercased), size ("small", "mid", "large", or "any")
Build the WHERE clause by appending to a list of clauses and a list of params, then joining with AND
Add unique_industries() and unique_states() helpers for the quiz dropdowns
Delete the old search_employers_by_industry once search_employers works
Update app.py to call the new function

Hints

search.py — fill in the blanks yourself:

APPROVAL_COLS = [
    "new_employment_approval",
    "continuation_approval",
    "change_employer_approval",
    "amended_approval",
]
TOTAL_APPROVALS_SQL = " + ".join(APPROVAL_COLS)

SIZE_RANGES = {
    "small": (0, 50),
    "mid":   (50, 500),
    "large": (500, 10_000_000),
}


def search_employers(filters, limit=50):
    clauses, params = [], []

    if filters.get("industry"):
        clauses.append("industry LIKE ?")
        params.append(f"%{filters['industry'].strip()}%")

    if filters.get("state"):
        clauses.append("state = ?")
        params.append(filters["state"].strip().upper())

    if filters.get("city"):
        clauses.append("city LIKE ?")
        params.append(f"%{filters['city'].strip().upper()}%")

    size = (filters.get("size") or "").lower()
    if size in SIZE_RANGES:
        lo, hi = SIZE_RANGES[size]
        clauses.append(f"({TOTAL_APPROVALS_SQL}) BETWEEN ? AND ?")
        params.extend([lo, hi])

    where = " AND ".join(clauses) if clauses else "1=1"

    query = f"""
        SELECT employer_name, industry, city, state, fiscal_year,
               ({TOTAL_APPROVALS_SQL}) AS total_approvals
          FROM employers
         WHERE {where}
         ORDER BY total_approvals DESC, employer_name ASC
         LIMIT ?
    """
    params.append(limit)

    with _connect() as conn:
        rows = conn.execute(query, params).fetchall()
    return [dict(r) for r in rows]


def unique_industries():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT industry FROM employers "
            "WHERE industry != '' ORDER BY industry"
        ).fetchall()
    return [r["industry"] for r in rows]


def unique_states():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT state FROM employers "
            "WHERE state != '' ORDER BY state"
        ).fetchall()
    return [r["state"] for r in rows]

Why build the query as a string + params list? Each filter is optional. You can't know the shape of the WHERE clause in advance. Appending to clauses / params together keeps the ? placeholders and the values in sync — that's what keeps the query safe from SQL injection.

Why TOTAL_APPROVALS_SQL as a constant? The four approval columns show up in both the filter (for size) and the SELECT (for sorting). Defining the sum once means you can't forget to update one when you change the other.

Optional — get help from your agent:

Here is my search_employers function. Explain how the clauses and
params lists work together to build a safe parameterized query.
What would go wrong if I used f-strings to inject user input
directly into the SQL? Don't change my code.

Phase 3: Aggregates for the dashboard

Handwrite this yourself. These three functions are what the dashboard charts read. Same filter shape as Phase 2 so they compose.

Objective

Add totals(filters), top_employers(filters, n=10), and approvals_by_state(filters) to search.py. All three take the same filters dict as search_employers.

Instructions

Write totals(filters) → returns {"employers": int, "approvals": int} for the filtered set
Write top_employers(filters, n=10) → returns top N employer rows by total approvals
Write approvals_by_state(filters) → returns [{"state": "CA", "total_approvals": 12345}, ...] sorted descending
Factor the "build WHERE clause from filters" logic into a small helper so the three aggregate functions don't duplicate it

Hints

Extract the WHERE-builder once so you aren't repeating the Phase 2 logic four times:

def _build_where(filters):
    """Return (where_sql, params) for the given filter dict."""
    clauses, params = [], []
    # ... exactly the same logic from search_employers, but return it ...
    where = " AND ".join(clauses) if clauses else "1=1"
    return where, params

Then search_employers, totals, top_employers, and approvals_by_state all call _build_where(filters) at the top.

totals:

def totals(filters):
    where, params = _build_where(filters)
    query = f"""
        SELECT COUNT(*) AS employers,
               COALESCE(SUM({TOTAL_APPROVALS_SQL}), 0) AS approvals
          FROM employers
         WHERE {where}
    """
    with _connect() as conn:
        row = conn.execute(query, params).fetchone()
    return {"employers": row["employers"], "approvals": int(row["approvals"])}

top_employers and approvals_by_state follow the same pattern — SELECT ... WHERE {where} GROUP BY ... ORDER BY ... LIMIT ?. You write those.

Optional — get help from your agent:

Walk me through why I pulled the WHERE-building logic into
_build_where() instead of copy-pasting it. What does this buy me
when I add a 5th filter later? Don't change my code.

Phase 4: Build the quiz flow

Objective

A 4-step onboarding wizard at /quiz?step=N. Each step asks one question and carries the prior answers forward via hidden inputs. Step 4's submit goes to /dashboard.

The 4 questions

Industry — dropdown populated from unique_industries()
State — dropdown from unique_states(), with an "Any state" option
City — optional text input
Company size — radio buttons: Small (< 50 approvals), Mid (50–500), Large (500+), Any

Instructions

Add a /quiz route in app.py that reads step from request.args (default to 1)
Read any prior answers (industry, state, city, size) from request.args too
Render quiz.html, passing the step number, the prior answers, and the dropdown options
In quiz.html, use {% if step == 1 %} ... {% elif step == 2 %} ... to show the right question
Every form in quiz.html has a method="get" pointing at /quiz for steps 1–3 and /dashboard for step 4, plus a hidden <input> for each prior answer and for the next step number
Update the home route / to redirect to /quiz?step=1 (or just rename / to be the quiz)

Hints

app.py — quiz route:

from flask import Flask, render_template, request, redirect, url_for
from search import unique_industries, unique_states

app = Flask(__name__)


@app.route("/")
def home():
    return redirect(url_for("quiz", step=1))


@app.route("/quiz")
def quiz():
    step = int(request.args.get("step", 1))
    answers = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "quiz.html",
        title="OPT Pal",
        step=step,
        answers=answers,
        industries=unique_industries() if step == 1 else [],
        states=unique_states() if step == 2 else [],
    )

templates/quiz.html — the passthrough pattern (this is the key trick):

{% extends "base.html" %} {% block content %}
<section class="panel">
  <h1>Let's narrow down your search</h1>
  <p class="meta">Step {{ step }} of 4</p>

  {% if step == 1 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="2" />
    <label for="industry">What industry are you targeting?</label>
    <select name="industry" id="industry">
      {% for opt in industries %}
      <option
        value="{{ opt }}"
        {%
        if
        opt=""
        ="answers.industry"
        %}selected{%
        endif
        %}
      >
        {{ opt }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 2 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="3" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <label for="state">Preferred U.S. state?</label>
    <select name="state" id="state">
      <option value="">Any state</option>
      {% for s in states %}
      <option value="{{ s }}" {% if s="" ="answers.state" %}selected{% endif %}>
        {{ s }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 3 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="4" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <label for="city">City (optional)</label>
    <input
      type="text"
      name="city"
      id="city"
      value="{{ answers.city }}"
      placeholder="e.g. ATLANTA"
    />
    <button type="submit">Next</button>
  </form>

  {% elif step == 4 %}
  <form method="get" action="{{ url_for('dashboard') }}">
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <input type="hidden" name="city" value="{{ answers.city }}" />
    <label>What size employer are you looking for?</label>
    {% for value, label in [ ('small', 'Small (< 50 approvals)'), ('mid', 'Mid
    (50 - 500)'), ('large', 'Large (500+)'), ('any', 'Any'), ] %}
    <label>
      <input
        type="radio"
        name="size"
        value="{{ value }}"
        {%
        if
        value=""
        ="answers.size"
        or
        (not
        answers.size
        and
        value=""
        ="any"
        )
        %}checked{%
        endif
        %}
      />
      {{ label }}
    </label>
    {% endfor %}
    <button type="submit">Show my matches</button>
  </form>
  {% endif %}
</section>
{% endblock %}

Optional — get help from your agent:

Here is my quiz.html. Keep the form logic and the hidden-input
passthrough exactly as-is. Help me style the 4 steps so they look
like a friendly onboarding flow — progress indicator (Step 2 of 4),
bigger labels, nicer buttons. Use my existing style.css.

Phase 5: Build the dashboard

Mixed phase. The /dashboard route is thin (reads filters, calls three search functions, renders). The template is mostly agent-assisted — Chart.js is a lot of boilerplate you don't need to hand-type.

Objective

A /dashboard route that shows a stats strip, two Chart.js bar charts, and the results table — all driven by the quiz answers in the URL.

Instructions

Add a /dashboard route in app.py that reads industry, state, city, size from request.args
Build a filters dict and call totals(filters), top_employers(filters, 10), approvals_by_state(filters), and search_employers(filters, 50)
Render dashboard.html passing all four results plus the raw filters dict
In dashboard.html, include Chart.js via CDN and render two bar charts: Top 10 employers and Approvals by state (or by city, if a state is already picked)
Add a "Refine search" link back to /quiz?step=1

Hints

app.py:

from search import (
    search_employers, totals, top_employers, approvals_by_state,
)


@app.route("/dashboard")
def dashboard():
    filters = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "dashboard.html",
        title="OPT Pal — Your matches",
        filters=filters,
        stats=totals(filters),
        top=top_employers(filters, 10),
        by_state=approvals_by_state(filters),
        rows=search_employers(filters, 50),
    )

templates/dashboard.html:

{% extends "base.html" %} {% block content %}
<section class="panel">
  <a href="{{ url_for('quiz', step=1) }}">&larr; Refine search</a>

  <h1>Your matches</h1>

  <div class="stat-strip">
    <div>
      <strong>{{ stats.employers }}</strong>
      employers
    </div>
    <div>
      <strong>{{ "{:,}".format(stats.approvals) }}</strong>
      total approvals
    </div>
    <div class="meta">
      Filters: {% if filters.industry %}Industry "{{ filters.industry }}"{%
      endif %} {% if filters.state %} · State {{ filters.state }}{% endif %} {%
      if filters.city %} · City {{ filters.city }}{% endif %} {% if filters.size
      %} · Size {{ filters.size }}{% endif %}
    </div>
  </div>

  <canvas id="top-employers-chart" height="120"></canvas>
  <canvas id="by-state-chart" height="120"></canvas>

  <script id="top-data" type="application/json">
    {{ top|tojson }}
  </script>
  <script id="state-data" type="application/json">
    {{ by_state|tojson }}
  </script>
</section>

<section class="panel">
  <h2>All matches ({{ rows|length }} shown)</h2>
  <table>
    <thead>
      <tr>
        <th>Employer</th>
        <th>Industry</th>
        <th>City</th>
        <th>State</th>
        <th>Approvals</th>
      </tr>
    </thead>
    <tbody>
      {% for r in rows %}
      <tr>
        <td>{{ r.employer_name }}</td>
        <td>{{ r.industry }}</td>
        <td>{{ r.city }}</td>
        <td>{{ r.state }}</td>
        <td>{{ r.total_approvals }}</td>
      </tr>
      {% endfor %}
    </tbody>
  </table>
</section>

<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script>
  const top = JSON.parse(document.getElementById("top-data").textContent)
  const byState = JSON.parse(document.getElementById("state-data").textContent)

  new Chart(document.getElementById("top-employers-chart"), {
    type: "bar",
    data: {
      labels: top.map((r) => r.employer_name),
      datasets: [
        { label: "Total approvals", data: top.map((r) => r.total_approvals) },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Top 10 employers" } },
    },
  })

  new Chart(document.getElementById("by-state-chart"), {
    type: "bar",
    data: {
      labels: byState.map((r) => r.state),
      datasets: [
        {
          label: "Total approvals",
          data: byState.map((r) => r.total_approvals),
        },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Approvals by state" } },
    },
  })
</script>
{% endblock %}

Why |tojson inside a <script type="application/json"> tag? Jinja safely serializes Python lists/dicts to a JSON string. Putting it inside an inert application/json script tag (not text/javascript) means it isn't executed — we read it with JSON.parse below. That's safer than writing const top = {{ top|tojson }}; which has XSS quirks.

Optional — get help from your agent (after your route + data shape are working):

My /dashboard route returns stats, top employers, approvals by
state, and a rows list. Here is dashboard.html. Help me polish
the Chart.js config: consistent colors, rotated x-axis labels so
long employer names don't overlap, and a legend off. Don't change
my route or the JSON embedded in the template.

Phase 6: Spec + journal + README

Handwrite this yourself. This is your voice on demo day.

Objective

Update the spec to match the new shape, fill in the empty journal, and give the README a short description of what OPT Pal does.

Instructions

Edit project.spec.md: rewrite the MVP list to be quiz onboarding → filtered dashboard with 2 charts. Move "sort", "export", and "compare" to stretch
Fill the Checkpoint 1 section of project.journal.md — a short catch-up paragraph on the data cleaning + SQLite import work you already shipped
Fill the Checkpoint 2 section of project.journal.md — what's in search.py (handwritten), what's in the dashboard template (agent-assisted), anything still rough
Update README.md — one paragraph + a code block with how to run
Commit and push

Hints

Spec MVP rewrite (cut half the bullets from your current spec):

**Must have (MVP):**

- [ ] **Quiz onboarding:** 4-step wizard collecting industry, state, optional city, and company-size preference
- [ ] **Filtered dashboard:** stats strip (employer count + total approvals), top 10 employers bar chart, approvals-by-state bar chart, and a results table
- [ ] **All filters compose:** every step narrows the same SQL query in `search.py`
- [ ] **Provenance and disclaimer:** USCIS source, fiscal year(s), and a plain-language note that historical data doesn't predict individual outcomes

Commit message idea:

checkpoint 2: search.py + quiz wizard + dashboard with Chart.js

Phase 7 (preview — not required for Checkpoint 2): Plotly US state map

After CP2 ships, the approvals_by_state(filters) function you already wrote becomes the data source for a US state choropleth. Plotly.js loads from a CDN, renders SVG, no npm required.

The sketch (for next week, don't build now):

New route /api/state-approvals.json that returns jsonify(approvals_by_state(filters))
Dashboard adds a <div id="us-map"> and a Plotly choropleth trace with locationmode: "USA-states", locations: states.map(r => r.state), z: states.map(r => r.total_approvals)
Same filter dict flows through — the map just visualizes what the bar chart already shows

Phase 8 (stretch — demo polish): three-globe or Mapbox

If the state map works smoothly and you have time before May 2, this is the demo-day centerpiece. Pick one of the following:

Stay 2D: add a second Chart.js line chart of approvals by fiscal_year, and a "compare two states" side-by-side view. Safer, still impressive.
Go 3D: swap the state map for a three-globe globe with arcs from each employer's state/city to a "you are here" point, or a Mapbox GL JS globe projection with city markers. Needs a Mapbox token for the latter.

Pick the 3D path only if the dashboard MVP is rock solid. Don't chase the globe at the cost of a working quiz.

Checkpoint 2 Readiness

By Thursday April 23 at 3pm:

search.py exists and does not import flask
search.py has search_employers, totals, top_employers, approvals_by_state, unique_industries, unique_states
helpers.py is cleaning-only (no search functions)
/quiz wizard with 4 steps works end-to-end, passing answers forward via hidden inputs
/dashboard renders stats strip + 2 Chart.js bar charts + results table
project.spec.md MVP reframed as quiz → dashboard
Checkpoint 1 + Checkpoint 2 entries in project.journal.md
README.md updated
Committed and pushed