> Source URL: /unit-3/project-paths/thu-h/thu-h-2026-04-21.guide
# Thu's Project Guide

**Project:** OPT Pal — H-1B Employer Data Hub
**Category:** Web App (Flask) + Data Science
**Last updated:** April 21

---

> Note: This guide reflects the latest state of your project repo. It may not match the most up-to-date version if you've worked since.

## Where You Are

Big jump from last week — real scaffolding is in the repo now:

- `helpers.py` loads the USCIS CSV, cleans it (renames the long Excel headers to short codes, drops blank employers, coerces counts to numbers), and writes it into SQLite via `import_data_into_SQL.py`. That's legit data-engineering work.
- `app.py` has one route that reads `?industry=` and calls `search_employers_by_industry`, which runs a parameterized `LIKE` query against `employers` in SQLite.
- `templates/home.html` renders a table of results with a styled search form.
- Your `project.spec.md` is written and already names charts + compare + multi-year as stretch goals.

What's missing for the pivot we talked about: the app today is a single text filter. The plan now is to reshape it into **a short onboarding quiz that narrows the search, then a dashboard with charts and (eventually) a US map.** That's a much better demo-day story for "OPT Pal."

Two small things to fix in passing:

- `search_employers_by_industry` currently lives in `helpers.py` next to cleaning. Search is business logic — it belongs in `search.py` on its own.
- `project.journal.md` is still the template. Checkpoint 1 + Checkpoint 2 entries need to go in by Thursday.

---

## Project Structure

Your project splits into two kinds of code:

- **Business logic — you handwrite this.** Everything in `search.py` (NEW): the filter function the quiz drives, and the aggregate functions the dashboard reads. These ARE your app — the decisions about _how_ someone narrows 12k employers and _what numbers_ they see when they land on the dashboard.
- **Library / view code — agent-assisted is fine.** Flask routes in `app.py` (they just read query params and pass dicts to templates), Jinja templates, Chart.js / Plotly.js snippets, CSS.
- **Keep as-is** → `helpers.py` stays as your cleaning module (`load_raw`, `clean_employers`, `load_clean`). No search logic in there.

Target layout by Thursday:

```
final-project-huynth4/
├── app.py                          ← Flask routes — agent-assisted OK
├── helpers.py                      ← cleaning only — already done
├── search.py                       ← business logic — handwrite (NEW, yours to own)
├── import_data_into_SQL.py         ← already done
├── pyproject.toml
├── templates/
│   ├── base.html
│   ├── quiz.html                   ← NEW — the onboarding wizard
│   └── dashboard.html              ← NEW — charts + stats + results
├── static/
│   └── style.css
├── instance/
│   └── opt_pal.db
└── data/
    └── Employer Information.csv
```

## Quiz flow

To setup your quiz, we'll create a sequence of URL routes that will guide the user through a series of questions leading to a dashboard with helpful information.

```
Home "/"  →  /quiz?step=1  →  /quiz?step=2  →  /quiz?step=3  →  /quiz?step=4  →  /dashboard
             industry         state            city (opt)       company size     charts + stats + results table
```

Each quiz step passes the prior answers forward as hidden inputs in the form, so state lives in the URL — no Flask session needed.

---

## Phase 1: Split cleaning vs search — create `search.py`

> **Handwrite this yourself.**

### Objective

Move `search_employers_by_industry` out of `helpers.py` into a new `search.py`. `helpers.py` becomes cleaning-only.

### Instructions

- [ ] Create `search.py` at the project root
- [ ] Move the `DB_PATH` constant and the `search_employers_by_industry` function from `helpers.py` into `search.py`
- [ ] Delete them from `helpers.py`
- [ ] Update the import at the top of `app.py`: `from search import search_employers_by_industry`
- [ ] Run `uv run flask --app app run --debug` and confirm nothing broke

### Hints

**Top of `search.py`:**

```python
"""Search + aggregate functions for OPT Pal.

All SQLite queries live here. No Flask imports.
"""

from pathlib import Path
import sqlite3

DB_PATH = Path(__file__).resolve().parent / "instance" / "opt_pal.db"


def _connect():
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    return conn
```

The `_connect()` helper means every function in `search.py` opens the DB the same way. The leading underscore is a Python convention for "private to this module."

> **Optional — get help from your agent:**
>
> ```text
> Walk me through the difference between helpers.py (cleaning) and
> search.py (queries). Why separate them? Don't change my code —
> just explain the reasoning.
> ```

---

## Phase 2: Generalize the filter — `search_employers(filters)`

> **Write most of this yourself.**

### Objective

Replace `search_employers_by_industry(user_input)` with `search_employers(filters)` that accepts any combination of `industry`, `state`, `city`, and `size` and builds one parameterized SQL query.

### Instructions

- [ ] In `search.py`, write `search_employers(filters, limit=50)` that returns a list of dicts
- [ ] Support these keys in `filters`: `industry` (substring), `state` (exact, uppercased), `city` (substring, uppercased), `size` (`"small"`, `"mid"`, `"large"`, or `"any"`)
- [ ] Build the `WHERE` clause by appending to a list of clauses and a list of params, then joining with `AND`
- [ ] Add `unique_industries()` and `unique_states()` helpers for the quiz dropdowns
- [ ] Delete the old `search_employers_by_industry` once `search_employers` works
- [ ] Update `app.py` to call the new function

### Hints

**`search.py` — fill in the blanks yourself:**

```python
APPROVAL_COLS = [
    "new_employment_approval",
    "continuation_approval",
    "change_employer_approval",
    "amended_approval",
]
TOTAL_APPROVALS_SQL = " + ".join(APPROVAL_COLS)

SIZE_RANGES = {
    "small": (0, 50),
    "mid":   (50, 500),
    "large": (500, 10_000_000),
}


def search_employers(filters, limit=50):
    clauses, params = [], []

    if filters.get("industry"):
        clauses.append("industry LIKE ?")
        params.append(f"%{filters['industry'].strip()}%")

    if filters.get("state"):
        clauses.append("state = ?")
        params.append(filters["state"].strip().upper())

    if filters.get("city"):
        clauses.append("city LIKE ?")
        params.append(f"%{filters['city'].strip().upper()}%")

    size = (filters.get("size") or "").lower()
    if size in SIZE_RANGES:
        lo, hi = SIZE_RANGES[size]
        clauses.append(f"({TOTAL_APPROVALS_SQL}) BETWEEN ? AND ?")
        params.extend([lo, hi])

    where = " AND ".join(clauses) if clauses else "1=1"

    query = f"""
        SELECT employer_name, industry, city, state, fiscal_year,
               ({TOTAL_APPROVALS_SQL}) AS total_approvals
          FROM employers
         WHERE {where}
         ORDER BY total_approvals DESC, employer_name ASC
         LIMIT ?
    """
    params.append(limit)

    with _connect() as conn:
        rows = conn.execute(query, params).fetchall()
    return [dict(r) for r in rows]


def unique_industries():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT industry FROM employers "
            "WHERE industry != '' ORDER BY industry"
        ).fetchall()
    return [r["industry"] for r in rows]


def unique_states():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT state FROM employers "
            "WHERE state != '' ORDER BY state"
        ).fetchall()
    return [r["state"] for r in rows]
```

**Why build the query as a string + params list?** Each filter is optional. You can't know the shape of the `WHERE` clause in advance. Appending to `clauses` / `params` together keeps the `?` placeholders and the values in sync — that's what keeps the query safe from SQL injection.

**Why `TOTAL_APPROVALS_SQL` as a constant?** The four approval columns show up in both the filter (for size) and the `SELECT` (for sorting). Defining the sum once means you can't forget to update one when you change the other.

> **Optional — get help from your agent:**
>
> ```text
> Here is my search_employers function. Explain how the clauses and
> params lists work together to build a safe parameterized query.
> What would go wrong if I used f-strings to inject user input
> directly into the SQL? Don't change my code.
> ```

---

## Phase 3: Aggregates for the dashboard

> **Handwrite this yourself.** These three functions are what the dashboard charts read. Same filter shape as Phase 2 so they compose.

### Objective

Add `totals(filters)`, `top_employers(filters, n=10)`, and `approvals_by_state(filters)` to `search.py`. All three take the same `filters` dict as `search_employers`.

### Instructions

- [ ] Write `totals(filters)` → returns `{"employers": int, "approvals": int}` for the filtered set
- [ ] Write `top_employers(filters, n=10)` → returns top N employer rows by total approvals
- [ ] Write `approvals_by_state(filters)` → returns `[{"state": "CA", "total_approvals": 12345}, ...]` sorted descending
- [ ] Factor the "build WHERE clause from filters" logic into a small helper so the three aggregate functions don't duplicate it

### Hints

**Extract the WHERE-builder once so you aren't repeating the Phase 2 logic four times:**

```python
def _build_where(filters):
    """Return (where_sql, params) for the given filter dict."""
    clauses, params = [], []
    # ... exactly the same logic from search_employers, but return it ...
    where = " AND ".join(clauses) if clauses else "1=1"
    return where, params
```

Then `search_employers`, `totals`, `top_employers`, and `approvals_by_state` all call `_build_where(filters)` at the top.

**`totals`:**

```python
def totals(filters):
    where, params = _build_where(filters)
    query = f"""
        SELECT COUNT(*) AS employers,
               COALESCE(SUM({TOTAL_APPROVALS_SQL}), 0) AS approvals
          FROM employers
         WHERE {where}
    """
    with _connect() as conn:
        row = conn.execute(query, params).fetchone()
    return {"employers": row["employers"], "approvals": int(row["approvals"])}
```

**`top_employers` and `approvals_by_state`** follow the same pattern — `SELECT ... WHERE {where} GROUP BY ... ORDER BY ... LIMIT ?`. You write those.

> **Optional — get help from your agent:**
>
> ```text
> Walk me through why I pulled the WHERE-building logic into
> _build_where() instead of copy-pasting it. What does this buy me
> when I add a 5th filter later? Don't change my code.
> ```

---

## Phase 4: Build the quiz flow

### Objective

A 4-step onboarding wizard at `/quiz?step=N`. Each step asks one question and carries the prior answers forward via hidden inputs. Step 4's submit goes to `/dashboard`.

### The 4 questions

1. **Industry** — dropdown populated from `unique_industries()`
2. **State** — dropdown from `unique_states()`, with an "Any state" option
3. **City** — optional text input
4. **Company size** — radio buttons: Small (< 50 approvals), Mid (50–500), Large (500+), Any

### Instructions

- [ ] Add a `/quiz` route in `app.py` that reads `step` from `request.args` (default to `1`)
- [ ] Read any prior answers (`industry`, `state`, `city`, `size`) from `request.args` too
- [ ] Render `quiz.html`, passing the step number, the prior answers, and the dropdown options
- [ ] In `quiz.html`, use `{% if step == 1 %} ... {% elif step == 2 %} ...` to show the right question
- [ ] Every form in `quiz.html` has a `method="get"` pointing at `/quiz` for steps 1–3 and `/dashboard` for step 4, plus a hidden `<input>` for each prior answer and for the next step number
- [ ] Update the home route `/` to redirect to `/quiz?step=1` (or just rename `/` to be the quiz)

### Hints

**`app.py` — quiz route:**

```python
from flask import Flask, render_template, request, redirect, url_for
from search import unique_industries, unique_states

app = Flask(__name__)


@app.route("/")
def home():
    return redirect(url_for("quiz", step=1))


@app.route("/quiz")
def quiz():
    step = int(request.args.get("step", 1))
    answers = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "quiz.html",
        title="OPT Pal",
        step=step,
        answers=answers,
        industries=unique_industries() if step == 1 else [],
        states=unique_states() if step == 2 else [],
    )
```

**`templates/quiz.html` — the passthrough pattern (this is the key trick):**

```html
{% extends "base.html" %} {% block content %}
<section class="panel">
  <h1>Let's narrow down your search</h1>
  <p class="meta">Step {{ step }} of 4</p>

  {% if step == 1 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="2" />
    <label for="industry">What industry are you targeting?</label>
    <select name="industry" id="industry">
      {% for opt in industries %}
      <option
        value="{{ opt }}"
        {%
        if
        opt=""
        ="answers.industry"
        %}selected{%
        endif
        %}
      >
        {{ opt }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 2 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="3" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <label for="state">Preferred U.S. state?</label>
    <select name="state" id="state">
      <option value="">Any state</option>
      {% for s in states %}
      <option value="{{ s }}" {% if s="" ="answers.state" %}selected{% endif %}>
        {{ s }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 3 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="4" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <label for="city">City (optional)</label>
    <input
      type="text"
      name="city"
      id="city"
      value="{{ answers.city }}"
      placeholder="e.g. ATLANTA"
    />
    <button type="submit">Next</button>
  </form>

  {% elif step == 4 %}
  <form method="get" action="{{ url_for('dashboard') }}">
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <input type="hidden" name="city" value="{{ answers.city }}" />
    <label>What size employer are you looking for?</label>
    {% for value, label in [ ('small', 'Small (< 50 approvals)'), ('mid', 'Mid
    (50 - 500)'), ('large', 'Large (500+)'), ('any', 'Any'), ] %}
    <label>
      <input
        type="radio"
        name="size"
        value="{{ value }}"
        {%
        if
        value=""
        ="answers.size"
        or
        (not
        answers.size
        and
        value=""
        ="any"
        )
        %}checked{%
        endif
        %}
      />
      {{ label }}
    </label>
    {% endfor %}
    <button type="submit">Show my matches</button>
  </form>
  {% endif %}
</section>
{% endblock %}
```

> **Optional — get help from your agent:**
>
> ```text
> Here is my quiz.html. Keep the form logic and the hidden-input
> passthrough exactly as-is. Help me style the 4 steps so they look
> like a friendly onboarding flow — progress indicator (Step 2 of 4),
> bigger labels, nicer buttons. Use my existing style.css.
> ```

---

## Phase 5: Build the dashboard

> **Mixed phase.** The `/dashboard` route is thin (reads filters, calls three search functions, renders). The template is mostly agent-assisted — Chart.js is a lot of boilerplate you don't need to hand-type.

### Objective

A `/dashboard` route that shows a stats strip, two Chart.js bar charts, and the results table — all driven by the quiz answers in the URL.

### Instructions

- [ ] Add a `/dashboard` route in `app.py` that reads `industry`, `state`, `city`, `size` from `request.args`
- [ ] Build a `filters` dict and call `totals(filters)`, `top_employers(filters, 10)`, `approvals_by_state(filters)`, and `search_employers(filters, 50)`
- [ ] Render `dashboard.html` passing all four results plus the raw `filters` dict
- [ ] In `dashboard.html`, include Chart.js via CDN and render two bar charts: **Top 10 employers** and **Approvals by state** (or by city, if a state is already picked)
- [ ] Add a "Refine search" link back to `/quiz?step=1`

### Hints

**`app.py`:**

```python
from search import (
    search_employers, totals, top_employers, approvals_by_state,
)


@app.route("/dashboard")
def dashboard():
    filters = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "dashboard.html",
        title="OPT Pal — Your matches",
        filters=filters,
        stats=totals(filters),
        top=top_employers(filters, 10),
        by_state=approvals_by_state(filters),
        rows=search_employers(filters, 50),
    )
```

**`templates/dashboard.html`:**

```html
{% extends "base.html" %} {% block content %}
<section class="panel">
  <a href="{{ url_for('quiz', step=1) }}">&larr; Refine search</a>

  <h1>Your matches</h1>

  <div class="stat-strip">
    <div>
      <strong>{{ stats.employers }}</strong>
      employers
    </div>
    <div>
      <strong>{{ "{:,}".format(stats.approvals) }}</strong>
      total approvals
    </div>
    <div class="meta">
      Filters: {% if filters.industry %}Industry "{{ filters.industry }}"{%
      endif %} {% if filters.state %} · State {{ filters.state }}{% endif %} {%
      if filters.city %} · City {{ filters.city }}{% endif %} {% if filters.size
      %} · Size {{ filters.size }}{% endif %}
    </div>
  </div>

  <canvas id="top-employers-chart" height="120"></canvas>
  <canvas id="by-state-chart" height="120"></canvas>

  <script id="top-data" type="application/json">
    {{ top|tojson }}
  </script>
  <script id="state-data" type="application/json">
    {{ by_state|tojson }}
  </script>
</section>

<section class="panel">
  <h2>All matches ({{ rows|length }} shown)</h2>
  <table>
    <thead>
      <tr>
        <th>Employer</th>
        <th>Industry</th>
        <th>City</th>
        <th>State</th>
        <th>Approvals</th>
      </tr>
    </thead>
    <tbody>
      {% for r in rows %}
      <tr>
        <td>{{ r.employer_name }}</td>
        <td>{{ r.industry }}</td>
        <td>{{ r.city }}</td>
        <td>{{ r.state }}</td>
        <td>{{ r.total_approvals }}</td>
      </tr>
      {% endfor %}
    </tbody>
  </table>
</section>

<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script>
  const top = JSON.parse(document.getElementById("top-data").textContent)
  const byState = JSON.parse(document.getElementById("state-data").textContent)

  new Chart(document.getElementById("top-employers-chart"), {
    type: "bar",
    data: {
      labels: top.map((r) => r.employer_name),
      datasets: [
        { label: "Total approvals", data: top.map((r) => r.total_approvals) },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Top 10 employers" } },
    },
  })

  new Chart(document.getElementById("by-state-chart"), {
    type: "bar",
    data: {
      labels: byState.map((r) => r.state),
      datasets: [
        {
          label: "Total approvals",
          data: byState.map((r) => r.total_approvals),
        },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Approvals by state" } },
    },
  })
</script>
{% endblock %}
```

**Why `|tojson` inside a `<script type="application/json">` tag?** Jinja safely serializes Python lists/dicts to a JSON string. Putting it inside an inert `application/json` script tag (not `text/javascript`) means it isn't executed — we read it with `JSON.parse` below. That's safer than writing `const top = {{ top|tojson }};` which has XSS quirks.

> **Optional — get help from your agent (after your route + data shape are working):**
>
> ```text
> My /dashboard route returns stats, top employers, approvals by
> state, and a rows list. Here is dashboard.html. Help me polish
> the Chart.js config: consistent colors, rotated x-axis labels so
> long employer names don't overlap, and a legend off. Don't change
> my route or the JSON embedded in the template.
> ```

---

## Phase 6: Spec + journal + README

> **Handwrite this yourself.** This is your voice on demo day.

### Objective

Update the spec to match the new shape, fill in the empty journal, and give the README a short description of what OPT Pal does.

### Instructions

- [ ] Edit `project.spec.md`: rewrite the MVP list to be **quiz onboarding → filtered dashboard with 2 charts**. Move "sort", "export", and "compare" to stretch
- [ ] Fill the **Checkpoint 1** section of `project.journal.md` — a short catch-up paragraph on the data cleaning + SQLite import work you already shipped
- [ ] Fill the **Checkpoint 2** section of `project.journal.md` — what's in `search.py` (handwritten), what's in the dashboard template (agent-assisted), anything still rough
- [ ] Update `README.md` — one paragraph + a code block with how to run
- [ ] Commit and push

### Hints

**Spec MVP rewrite (cut half the bullets from your current spec):**

```markdown
**Must have (MVP):**

- [ ] **Quiz onboarding:** 4-step wizard collecting industry, state, optional city, and company-size preference
- [ ] **Filtered dashboard:** stats strip (employer count + total approvals), top 10 employers bar chart, approvals-by-state bar chart, and a results table
- [ ] **All filters compose:** every step narrows the same SQL query in `search.py`
- [ ] **Provenance and disclaimer:** USCIS source, fiscal year(s), and a plain-language note that historical data doesn't predict individual outcomes
```

**Commit message idea:**

```
checkpoint 2: search.py + quiz wizard + dashboard with Chart.js
```

---

## Phase 7 (preview — not required for Checkpoint 2): Plotly US state map

After CP2 ships, the `approvals_by_state(filters)` function you already wrote becomes the data source for a US state choropleth. Plotly.js loads from a CDN, renders SVG, no npm required.

**The sketch** (for next week, don't build now):

- New route `/api/state-approvals.json` that returns `jsonify(approvals_by_state(filters))`
- Dashboard adds a `<div id="us-map">` and a Plotly `choropleth` trace with `locationmode: "USA-states"`, `locations: states.map(r => r.state)`, `z: states.map(r => r.total_approvals)`
- Same filter dict flows through — the map just visualizes what the bar chart already shows

---

## Phase 8 (stretch — demo polish): three-globe or Mapbox

If the state map works smoothly and you have time before May 2, this is the demo-day centerpiece. Pick one of the following:

- **Stay 2D:** add a second Chart.js line chart of approvals by `fiscal_year`, and a "compare two states" side-by-side view. Safer, still impressive.
- **Go 3D:** swap the state map for a [three-globe](https://github.com/vasturiano/three-globe) globe with arcs from each employer's state/city to a "you are here" point, or a [Mapbox GL JS](https://docs.mapbox.com/mapbox-gl-js/) globe projection with city markers. Needs a Mapbox token for the latter.

Pick the 3D path only if the dashboard MVP is rock solid. Don't chase the globe at the cost of a working quiz.

---

## Checkpoint 2 Readiness

By Thursday April 23 at 3pm:

- [ ] `search.py` exists and does **not** import `flask`
- [ ] `search.py` has `search_employers`, `totals`, `top_employers`, `approvals_by_state`, `unique_industries`, `unique_states`
- [ ] `helpers.py` is cleaning-only (no search functions)
- [ ] `/quiz` wizard with 4 steps works end-to-end, passing answers forward via hidden inputs
- [ ] `/dashboard` renders stats strip + 2 Chart.js bar charts + results table
- [ ] `project.spec.md` MVP reframed as quiz → dashboard
- [ ] Checkpoint 1 + Checkpoint 2 entries in `project.journal.md`
- [ ] `README.md` updated
- [ ] Committed and pushed

## Helpful Resources

- [Checkpoint 2 Instructions](../../projects/final-project-checkpoint-2.project.md)
- [Lecture 1: The MVP](../../lectures/01-the-mvp/01-the-mvp.lecture.md)
- [Flask Setup Guide](../../resources/flask-setup.guide.md)
- [Chart.js — getting started](https://www.chartjs.org/docs/latest/getting-started/)
- [Plotly.js choropleth (for Phase 7)](https://plotly.com/javascript/choropleth-maps/)
- [USCIS H-1B Employer Data Hub](https://www.uscis.gov/tools/reports-and-studies/h-1b-employer-data-hub)


---

## Backlinks

The following sources link to this document:

- [April 21 -- Pivot to quiz onboarding + dashboard](/unit-3/project-paths/thu-h/thu-h.path.llm.md)