Thu's Project Guide

Project: OPT Pal — H-1B Employer Data Hub Category: Web App (Flask) + Data Science Last updated: April 21

Note: This guide reflects the latest state of your project repo. It may not match the most up-to-date version if you've worked since.

Where You Are

Big jump from last week — real scaffolding is in the repo now:

  • helpers.py loads the USCIS CSV, cleans it (renames the long Excel headers to short codes, drops blank employers, coerces counts to numbers), and writes it into SQLite via import_data_into_SQL.py. That's legit data-engineering work.
  • app.py has one route that reads ?industry= and calls search_employers_by_industry, which runs a parameterized LIKE query against employers in SQLite.
  • templates/home.html renders a table of results with a styled search form.
  • Your project.spec.md is written and already names charts + compare + multi-year as stretch goals.

What's missing for the pivot we talked about: the app today is a single text filter. The plan now is to reshape it into a short onboarding quiz that narrows the search, then a dashboard with charts and (eventually) a US map. That's a much better demo-day story for "OPT Pal."

Two small things to fix in passing:

  • search_employers_by_industry currently lives in helpers.py next to cleaning. Search is business logic — it belongs in search.py on its own.
  • project.journal.md is still the template. Checkpoint 1 + Checkpoint 2 entries need to go in by Thursday.

Project Structure

Your project splits into two kinds of code:

  • Business logic — you handwrite this. Everything in search.py (NEW): the filter function the quiz drives, and the aggregate functions the dashboard reads. These ARE your app — the decisions about how someone narrows 12k employers and what numbers they see when they land on the dashboard.
  • Library / view code — agent-assisted is fine. Flask routes in app.py (they just read query params and pass dicts to templates), Jinja templates, Chart.js / Plotly.js snippets, CSS.
  • Keep as-ishelpers.py stays as your cleaning module (load_raw, clean_employers, load_clean). No search logic in there.

Target layout by Thursday:

final-project-huynth4/
├── app.py                          ← Flask routes — agent-assisted OK
├── helpers.py                      ← cleaning only — already done
├── search.py                       ← business logic — handwrite (NEW, yours to own)
├── import_data_into_SQL.py         ← already done
├── pyproject.toml
├── templates/
│   ├── base.html
│   ├── quiz.html                   ← NEW — the onboarding wizard
│   └── dashboard.html              ← NEW — charts + stats + results
├── static/
│   └── style.css
├── instance/
│   └── opt_pal.db
└── data/
    └── Employer Information.csv

Quiz flow

To setup your quiz, we'll create a sequence of URL routes that will guide the user through a series of questions leading to a dashboard with helpful information.

Home "/"  →  /quiz?step=1  →  /quiz?step=2  →  /quiz?step=3  →  /quiz?step=4  →  /dashboard
             industry         state            city (opt)       company size     charts + stats + results table

Each quiz step passes the prior answers forward as hidden inputs in the form, so state lives in the URL — no Flask session needed.

Phase 1: Split cleaning vs search — create search.py

Handwrite this yourself.

Objective

Move search_employers_by_industry out of helpers.py into a new search.py. helpers.py becomes cleaning-only.

Instructions

Hints

Top of search.py:

"""Search + aggregate functions for OPT Pal.

All SQLite queries live here. No Flask imports.
"""

from pathlib import Path
import sqlite3

DB_PATH = Path(__file__).resolve().parent / "instance" / "opt_pal.db"


def _connect():
    conn = sqlite3.connect(DB_PATH)
    conn.row_factory = sqlite3.Row
    return conn

The _connect() helper means every function in search.py opens the DB the same way. The leading underscore is a Python convention for "private to this module."

Optional — get help from your agent:

Walk me through the difference between helpers.py (cleaning) and
search.py (queries). Why separate them? Don't change my code —
just explain the reasoning.

Phase 2: Generalize the filter — search_employers(filters)

Write most of this yourself.

Objective

Replace search_employers_by_industry(user_input) with search_employers(filters) that accepts any combination of industry, state, city, and size and builds one parameterized SQL query.

Instructions

Hints

search.py — fill in the blanks yourself:

APPROVAL_COLS = [
    "new_employment_approval",
    "continuation_approval",
    "change_employer_approval",
    "amended_approval",
]
TOTAL_APPROVALS_SQL = " + ".join(APPROVAL_COLS)

SIZE_RANGES = {
    "small": (0, 50),
    "mid":   (50, 500),
    "large": (500, 10_000_000),
}


def search_employers(filters, limit=50):
    clauses, params = [], []

    if filters.get("industry"):
        clauses.append("industry LIKE ?")
        params.append(f"%{filters['industry'].strip()}%")

    if filters.get("state"):
        clauses.append("state = ?")
        params.append(filters["state"].strip().upper())

    if filters.get("city"):
        clauses.append("city LIKE ?")
        params.append(f"%{filters['city'].strip().upper()}%")

    size = (filters.get("size") or "").lower()
    if size in SIZE_RANGES:
        lo, hi = SIZE_RANGES[size]
        clauses.append(f"({TOTAL_APPROVALS_SQL}) BETWEEN ? AND ?")
        params.extend([lo, hi])

    where = " AND ".join(clauses) if clauses else "1=1"

    query = f"""
        SELECT employer_name, industry, city, state, fiscal_year,
               ({TOTAL_APPROVALS_SQL}) AS total_approvals
          FROM employers
         WHERE {where}
         ORDER BY total_approvals DESC, employer_name ASC
         LIMIT ?
    """
    params.append(limit)

    with _connect() as conn:
        rows = conn.execute(query, params).fetchall()
    return [dict(r) for r in rows]


def unique_industries():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT industry FROM employers "
            "WHERE industry != '' ORDER BY industry"
        ).fetchall()
    return [r["industry"] for r in rows]


def unique_states():
    with _connect() as conn:
        rows = conn.execute(
            "SELECT DISTINCT state FROM employers "
            "WHERE state != '' ORDER BY state"
        ).fetchall()
    return [r["state"] for r in rows]

Why build the query as a string + params list? Each filter is optional. You can't know the shape of the WHERE clause in advance. Appending to clauses / params together keeps the ? placeholders and the values in sync — that's what keeps the query safe from SQL injection.

Why TOTAL_APPROVALS_SQL as a constant? The four approval columns show up in both the filter (for size) and the SELECT (for sorting). Defining the sum once means you can't forget to update one when you change the other.

Optional — get help from your agent:

Here is my search_employers function. Explain how the clauses and
params lists work together to build a safe parameterized query.
What would go wrong if I used f-strings to inject user input
directly into the SQL? Don't change my code.

Phase 3: Aggregates for the dashboard

Handwrite this yourself. These three functions are what the dashboard charts read. Same filter shape as Phase 2 so they compose.

Objective

Add totals(filters), top_employers(filters, n=10), and approvals_by_state(filters) to search.py. All three take the same filters dict as search_employers.

Instructions

Hints

Extract the WHERE-builder once so you aren't repeating the Phase 2 logic four times:

def _build_where(filters):
    """Return (where_sql, params) for the given filter dict."""
    clauses, params = [], []
    # ... exactly the same logic from search_employers, but return it ...
    where = " AND ".join(clauses) if clauses else "1=1"
    return where, params

Then search_employers, totals, top_employers, and approvals_by_state all call _build_where(filters) at the top.

totals:

def totals(filters):
    where, params = _build_where(filters)
    query = f"""
        SELECT COUNT(*) AS employers,
               COALESCE(SUM({TOTAL_APPROVALS_SQL}), 0) AS approvals
          FROM employers
         WHERE {where}
    """
    with _connect() as conn:
        row = conn.execute(query, params).fetchone()
    return {"employers": row["employers"], "approvals": int(row["approvals"])}

top_employers and approvals_by_state follow the same pattern — SELECT ... WHERE {where} GROUP BY ... ORDER BY ... LIMIT ?. You write those.

Optional — get help from your agent:

Walk me through why I pulled the WHERE-building logic into
_build_where() instead of copy-pasting it. What does this buy me
when I add a 5th filter later? Don't change my code.

Phase 4: Build the quiz flow

Objective

A 4-step onboarding wizard at /quiz?step=N. Each step asks one question and carries the prior answers forward via hidden inputs. Step 4's submit goes to /dashboard.

The 4 questions

  1. Industry — dropdown populated from unique_industries()
  2. State — dropdown from unique_states(), with an "Any state" option
  3. City — optional text input
  4. Company size — radio buttons: Small (< 50 approvals), Mid (50–500), Large (500+), Any

Instructions

Hints

app.py — quiz route:

from flask import Flask, render_template, request, redirect, url_for
from search import unique_industries, unique_states

app = Flask(__name__)


@app.route("/")
def home():
    return redirect(url_for("quiz", step=1))


@app.route("/quiz")
def quiz():
    step = int(request.args.get("step", 1))
    answers = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "quiz.html",
        title="OPT Pal",
        step=step,
        answers=answers,
        industries=unique_industries() if step == 1 else [],
        states=unique_states() if step == 2 else [],
    )

templates/quiz.html — the passthrough pattern (this is the key trick):

{% extends "base.html" %} {% block content %}
<section class="panel">
  <h1>Let's narrow down your search</h1>
  <p class="meta">Step {{ step }} of 4</p>

  {% if step == 1 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="2" />
    <label for="industry">What industry are you targeting?</label>
    <select name="industry" id="industry">
      {% for opt in industries %}
      <option
        value="{{ opt }}"
        {%
        if
        opt=""
        ="answers.industry"
        %}selected{%
        endif
        %}
      >
        {{ opt }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 2 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="3" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <label for="state">Preferred U.S. state?</label>
    <select name="state" id="state">
      <option value="">Any state</option>
      {% for s in states %}
      <option value="{{ s }}" {% if s="" ="answers.state" %}selected{% endif %}>
        {{ s }}
      </option>
      {% endfor %}
    </select>
    <button type="submit">Next</button>
  </form>

  {% elif step == 3 %}
  <form method="get" action="{{ url_for('quiz') }}">
    <input type="hidden" name="step" value="4" />
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <label for="city">City (optional)</label>
    <input
      type="text"
      name="city"
      id="city"
      value="{{ answers.city }}"
      placeholder="e.g. ATLANTA"
    />
    <button type="submit">Next</button>
  </form>

  {% elif step == 4 %}
  <form method="get" action="{{ url_for('dashboard') }}">
    <input type="hidden" name="industry" value="{{ answers.industry }}" />
    <input type="hidden" name="state" value="{{ answers.state }}" />
    <input type="hidden" name="city" value="{{ answers.city }}" />
    <label>What size employer are you looking for?</label>
    {% for value, label in [ ('small', 'Small (< 50 approvals)'), ('mid', 'Mid
    (50 - 500)'), ('large', 'Large (500+)'), ('any', 'Any'), ] %}
    <label>
      <input
        type="radio"
        name="size"
        value="{{ value }}"
        {%
        if
        value=""
        ="answers.size"
        or
        (not
        answers.size
        and
        value=""
        ="any"
        )
        %}checked{%
        endif
        %}
      />
      {{ label }}
    </label>
    {% endfor %}
    <button type="submit">Show my matches</button>
  </form>
  {% endif %}
</section>
{% endblock %}

Optional — get help from your agent:

Here is my quiz.html. Keep the form logic and the hidden-input
passthrough exactly as-is. Help me style the 4 steps so they look
like a friendly onboarding flow — progress indicator (Step 2 of 4),
bigger labels, nicer buttons. Use my existing style.css.

Phase 5: Build the dashboard

Mixed phase. The /dashboard route is thin (reads filters, calls three search functions, renders). The template is mostly agent-assisted — Chart.js is a lot of boilerplate you don't need to hand-type.

Objective

A /dashboard route that shows a stats strip, two Chart.js bar charts, and the results table — all driven by the quiz answers in the URL.

Instructions

Hints

app.py:

from search import (
    search_employers, totals, top_employers, approvals_by_state,
)


@app.route("/dashboard")
def dashboard():
    filters = {
        "industry": request.args.get("industry", ""),
        "state":    request.args.get("state", ""),
        "city":     request.args.get("city", ""),
        "size":     request.args.get("size", ""),
    }
    return render_template(
        "dashboard.html",
        title="OPT Pal — Your matches",
        filters=filters,
        stats=totals(filters),
        top=top_employers(filters, 10),
        by_state=approvals_by_state(filters),
        rows=search_employers(filters, 50),
    )

templates/dashboard.html:

{% extends "base.html" %} {% block content %}
<section class="panel">
  <a href="{{ url_for('quiz', step=1) }}">&larr; Refine search</a>

  <h1>Your matches</h1>

  <div class="stat-strip">
    <div>
      <strong>{{ stats.employers }}</strong>
      employers
    </div>
    <div>
      <strong>{{ "{:,}".format(stats.approvals) }}</strong>
      total approvals
    </div>
    <div class="meta">
      Filters: {% if filters.industry %}Industry "{{ filters.industry }}"{%
      endif %} {% if filters.state %} · State {{ filters.state }}{% endif %} {%
      if filters.city %} · City {{ filters.city }}{% endif %} {% if filters.size
      %} · Size {{ filters.size }}{% endif %}
    </div>
  </div>

  <canvas id="top-employers-chart" height="120"></canvas>
  <canvas id="by-state-chart" height="120"></canvas>

  <script id="top-data" type="application/json">
    {{ top|tojson }}
  </script>
  <script id="state-data" type="application/json">
    {{ by_state|tojson }}
  </script>
</section>

<section class="panel">
  <h2>All matches ({{ rows|length }} shown)</h2>
  <table>
    <thead>
      <tr>
        <th>Employer</th>
        <th>Industry</th>
        <th>City</th>
        <th>State</th>
        <th>Approvals</th>
      </tr>
    </thead>
    <tbody>
      {% for r in rows %}
      <tr>
        <td>{{ r.employer_name }}</td>
        <td>{{ r.industry }}</td>
        <td>{{ r.city }}</td>
        <td>{{ r.state }}</td>
        <td>{{ r.total_approvals }}</td>
      </tr>
      {% endfor %}
    </tbody>
  </table>
</section>

<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script>
  const top = JSON.parse(document.getElementById("top-data").textContent)
  const byState = JSON.parse(document.getElementById("state-data").textContent)

  new Chart(document.getElementById("top-employers-chart"), {
    type: "bar",
    data: {
      labels: top.map((r) => r.employer_name),
      datasets: [
        { label: "Total approvals", data: top.map((r) => r.total_approvals) },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Top 10 employers" } },
    },
  })

  new Chart(document.getElementById("by-state-chart"), {
    type: "bar",
    data: {
      labels: byState.map((r) => r.state),
      datasets: [
        {
          label: "Total approvals",
          data: byState.map((r) => r.total_approvals),
        },
      ],
    },
    options: {
      plugins: { title: { display: true, text: "Approvals by state" } },
    },
  })
</script>
{% endblock %}

Why |tojson inside a <script type="application/json"> tag? Jinja safely serializes Python lists/dicts to a JSON string. Putting it inside an inert application/json script tag (not text/javascript) means it isn't executed — we read it with JSON.parse below. That's safer than writing const top = {{ top|tojson }}; which has XSS quirks.

Optional — get help from your agent (after your route + data shape are working):

My /dashboard route returns stats, top employers, approvals by
state, and a rows list. Here is dashboard.html. Help me polish
the Chart.js config: consistent colors, rotated x-axis labels so
long employer names don't overlap, and a legend off. Don't change
my route or the JSON embedded in the template.

Phase 6: Spec + journal + README

Handwrite this yourself. This is your voice on demo day.

Objective

Update the spec to match the new shape, fill in the empty journal, and give the README a short description of what OPT Pal does.

Instructions

Hints

Spec MVP rewrite (cut half the bullets from your current spec):

**Must have (MVP):**

- [ ] **Quiz onboarding:** 4-step wizard collecting industry, state, optional city, and company-size preference
- [ ] **Filtered dashboard:** stats strip (employer count + total approvals), top 10 employers bar chart, approvals-by-state bar chart, and a results table
- [ ] **All filters compose:** every step narrows the same SQL query in `search.py`
- [ ] **Provenance and disclaimer:** USCIS source, fiscal year(s), and a plain-language note that historical data doesn't predict individual outcomes

Commit message idea:

checkpoint 2: search.py + quiz wizard + dashboard with Chart.js

Phase 7 (preview — not required for Checkpoint 2): Plotly US state map

After CP2 ships, the approvals_by_state(filters) function you already wrote becomes the data source for a US state choropleth. Plotly.js loads from a CDN, renders SVG, no npm required.

The sketch (for next week, don't build now):

  • New route /api/state-approvals.json that returns jsonify(approvals_by_state(filters))
  • Dashboard adds a <div id="us-map"> and a Plotly choropleth trace with locationmode: "USA-states", locations: states.map(r => r.state), z: states.map(r => r.total_approvals)
  • Same filter dict flows through — the map just visualizes what the bar chart already shows

Phase 8 (stretch — demo polish): three-globe or Mapbox

If the state map works smoothly and you have time before May 2, this is the demo-day centerpiece. Pick one of the following:

  • Stay 2D: add a second Chart.js line chart of approvals by fiscal_year, and a "compare two states" side-by-side view. Safer, still impressive.
  • Go 3D: swap the state map for a three-globe globe with arcs from each employer's state/city to a "you are here" point, or a Mapbox GL JS globe projection with city markers. Needs a Mapbox token for the latter.

Pick the 3D path only if the dashboard MVP is rock solid. Don't chase the globe at the cost of a working quiz.

Checkpoint 2 Readiness

By Thursday April 23 at 3pm:

Helpful Resources