Thu's Project Guide
Project: OPT Pal — H-1B Employer Data Hub Category: Web App (Flask) + Data Science Last updated: April 21
Note: This guide reflects the latest state of your project repo. It may not match the most up-to-date version if you've worked since.
Where You Are
Big jump from last week — real scaffolding is in the repo now:
helpers.pyloads the USCIS CSV, cleans it (renames the long Excel headers to short codes, drops blank employers, coerces counts to numbers), and writes it into SQLite viaimport_data_into_SQL.py. That's legit data-engineering work.app.pyhas one route that reads?industry=and callssearch_employers_by_industry, which runs a parameterizedLIKEquery againstemployersin SQLite.templates/home.htmlrenders a table of results with a styled search form.- Your
project.spec.mdis written and already names charts + compare + multi-year as stretch goals.
What's missing for the pivot we talked about: the app today is a single text filter. The plan now is to reshape it into a short onboarding quiz that narrows the search, then a dashboard with charts and (eventually) a US map. That's a much better demo-day story for "OPT Pal."
Two small things to fix in passing:
search_employers_by_industrycurrently lives inhelpers.pynext to cleaning. Search is business logic — it belongs insearch.pyon its own.project.journal.mdis still the template. Checkpoint 1 + Checkpoint 2 entries need to go in by Thursday.
Project Structure
Your project splits into two kinds of code:
- Business logic — you handwrite this. Everything in
search.py(NEW): the filter function the quiz drives, and the aggregate functions the dashboard reads. These ARE your app — the decisions about how someone narrows 12k employers and what numbers they see when they land on the dashboard. - Library / view code — agent-assisted is fine. Flask routes in
app.py(they just read query params and pass dicts to templates), Jinja templates, Chart.js / Plotly.js snippets, CSS. - Keep as-is →
helpers.pystays as your cleaning module (load_raw,clean_employers,load_clean). No search logic in there.
Target layout by Thursday:
final-project-huynth4/
├── app.py ← Flask routes — agent-assisted OK
├── helpers.py ← cleaning only — already done
├── search.py ← business logic — handwrite (NEW, yours to own)
├── import_data_into_SQL.py ← already done
├── pyproject.toml
├── templates/
│ ├── base.html
│ ├── quiz.html ← NEW — the onboarding wizard
│ └── dashboard.html ← NEW — charts + stats + results
├── static/
│ └── style.css
├── instance/
│ └── opt_pal.db
└── data/
└── Employer Information.csv
Quiz flow
To setup your quiz, we'll create a sequence of URL routes that will guide the user through a series of questions leading to a dashboard with helpful information.
Home "/" → /quiz?step=1 → /quiz?step=2 → /quiz?step=3 → /quiz?step=4 → /dashboard
industry state city (opt) company size charts + stats + results table
Each quiz step passes the prior answers forward as hidden inputs in the form, so state lives in the URL — no Flask session needed.
Phase 1: Split cleaning vs search — create search.py
Handwrite this yourself.
Objective
Move search_employers_by_industry out of helpers.py into a new search.py. helpers.py becomes cleaning-only.
Instructions
Hints
Top of search.py:
"""Search + aggregate functions for OPT Pal.
All SQLite queries live here. No Flask imports.
"""
from pathlib import Path
import sqlite3
DB_PATH = Path(__file__).resolve().parent / "instance" / "opt_pal.db"
def _connect():
conn = sqlite3.connect(DB_PATH)
conn.row_factory = sqlite3.Row
return conn
The _connect() helper means every function in search.py opens the DB the same way. The leading underscore is a Python convention for "private to this module."
Optional — get help from your agent:
Walk me through the difference between helpers.py (cleaning) and search.py (queries). Why separate them? Don't change my code — just explain the reasoning.
Phase 2: Generalize the filter — search_employers(filters)
Write most of this yourself.
Objective
Replace search_employers_by_industry(user_input) with search_employers(filters) that accepts any combination of industry, state, city, and size and builds one parameterized SQL query.
Instructions
Hints
search.py — fill in the blanks yourself:
APPROVAL_COLS = [
"new_employment_approval",
"continuation_approval",
"change_employer_approval",
"amended_approval",
]
TOTAL_APPROVALS_SQL = " + ".join(APPROVAL_COLS)
SIZE_RANGES = {
"small": (0, 50),
"mid": (50, 500),
"large": (500, 10_000_000),
}
def search_employers(filters, limit=50):
clauses, params = [], []
if filters.get("industry"):
clauses.append("industry LIKE ?")
params.append(f"%{filters['industry'].strip()}%")
if filters.get("state"):
clauses.append("state = ?")
params.append(filters["state"].strip().upper())
if filters.get("city"):
clauses.append("city LIKE ?")
params.append(f"%{filters['city'].strip().upper()}%")
size = (filters.get("size") or "").lower()
if size in SIZE_RANGES:
lo, hi = SIZE_RANGES[size]
clauses.append(f"({TOTAL_APPROVALS_SQL}) BETWEEN ? AND ?")
params.extend([lo, hi])
where = " AND ".join(clauses) if clauses else "1=1"
query = f"""
SELECT employer_name, industry, city, state, fiscal_year,
({TOTAL_APPROVALS_SQL}) AS total_approvals
FROM employers
WHERE {where}
ORDER BY total_approvals DESC, employer_name ASC
LIMIT ?
"""
params.append(limit)
with _connect() as conn:
rows = conn.execute(query, params).fetchall()
return [dict(r) for r in rows]
def unique_industries():
with _connect() as conn:
rows = conn.execute(
"SELECT DISTINCT industry FROM employers "
"WHERE industry != '' ORDER BY industry"
).fetchall()
return [r["industry"] for r in rows]
def unique_states():
with _connect() as conn:
rows = conn.execute(
"SELECT DISTINCT state FROM employers "
"WHERE state != '' ORDER BY state"
).fetchall()
return [r["state"] for r in rows]
Why build the query as a string + params list? Each filter is optional. You can't know the shape of the WHERE clause in advance. Appending to clauses / params together keeps the ? placeholders and the values in sync — that's what keeps the query safe from SQL injection.
Why TOTAL_APPROVALS_SQL as a constant? The four approval columns show up in both the filter (for size) and the SELECT (for sorting). Defining the sum once means you can't forget to update one when you change the other.
Optional — get help from your agent:
Here is my search_employers function. Explain how the clauses and params lists work together to build a safe parameterized query. What would go wrong if I used f-strings to inject user input directly into the SQL? Don't change my code.
Phase 3: Aggregates for the dashboard
Handwrite this yourself. These three functions are what the dashboard charts read. Same filter shape as Phase 2 so they compose.
Objective
Add totals(filters), top_employers(filters, n=10), and approvals_by_state(filters) to search.py. All three take the same filters dict as search_employers.
Instructions
Hints
Extract the WHERE-builder once so you aren't repeating the Phase 2 logic four times:
def _build_where(filters):
"""Return (where_sql, params) for the given filter dict."""
clauses, params = [], []
# ... exactly the same logic from search_employers, but return it ...
where = " AND ".join(clauses) if clauses else "1=1"
return where, params
Then search_employers, totals, top_employers, and approvals_by_state all call _build_where(filters) at the top.
totals:
def totals(filters):
where, params = _build_where(filters)
query = f"""
SELECT COUNT(*) AS employers,
COALESCE(SUM({TOTAL_APPROVALS_SQL}), 0) AS approvals
FROM employers
WHERE {where}
"""
with _connect() as conn:
row = conn.execute(query, params).fetchone()
return {"employers": row["employers"], "approvals": int(row["approvals"])}
top_employers and approvals_by_state follow the same pattern — SELECT ... WHERE {where} GROUP BY ... ORDER BY ... LIMIT ?. You write those.
Optional — get help from your agent:
Walk me through why I pulled the WHERE-building logic into _build_where() instead of copy-pasting it. What does this buy me when I add a 5th filter later? Don't change my code.
Phase 4: Build the quiz flow
Objective
A 4-step onboarding wizard at /quiz?step=N. Each step asks one question and carries the prior answers forward via hidden inputs. Step 4's submit goes to /dashboard.
The 4 questions
- Industry — dropdown populated from
unique_industries() - State — dropdown from
unique_states(), with an "Any state" option - City — optional text input
- Company size — radio buttons: Small (< 50 approvals), Mid (50–500), Large (500+), Any
Instructions
Hints
app.py — quiz route:
from flask import Flask, render_template, request, redirect, url_for
from search import unique_industries, unique_states
app = Flask(__name__)
@app.route("/")
def home():
return redirect(url_for("quiz", step=1))
@app.route("/quiz")
def quiz():
step = int(request.args.get("step", 1))
answers = {
"industry": request.args.get("industry", ""),
"state": request.args.get("state", ""),
"city": request.args.get("city", ""),
"size": request.args.get("size", ""),
}
return render_template(
"quiz.html",
title="OPT Pal",
step=step,
answers=answers,
industries=unique_industries() if step == 1 else [],
states=unique_states() if step == 2 else [],
)
templates/quiz.html — the passthrough pattern (this is the key trick):
{% extends "base.html" %} {% block content %}
<section class="panel">
<h1>Let's narrow down your search</h1>
<p class="meta">Step {{ step }} of 4</p>
{% if step == 1 %}
<form method="get" action="{{ url_for('quiz') }}">
<input type="hidden" name="step" value="2" />
<label for="industry">What industry are you targeting?</label>
<select name="industry" id="industry">
{% for opt in industries %}
<option
value="{{ opt }}"
{%
if
opt=""
="answers.industry"
%}selected{%
endif
%}
>
{{ opt }}
</option>
{% endfor %}
</select>
<button type="submit">Next</button>
</form>
{% elif step == 2 %}
<form method="get" action="{{ url_for('quiz') }}">
<input type="hidden" name="step" value="3" />
<input type="hidden" name="industry" value="{{ answers.industry }}" />
<label for="state">Preferred U.S. state?</label>
<select name="state" id="state">
<option value="">Any state</option>
{% for s in states %}
<option value="{{ s }}" {% if s="" ="answers.state" %}selected{% endif %}>
{{ s }}
</option>
{% endfor %}
</select>
<button type="submit">Next</button>
</form>
{% elif step == 3 %}
<form method="get" action="{{ url_for('quiz') }}">
<input type="hidden" name="step" value="4" />
<input type="hidden" name="industry" value="{{ answers.industry }}" />
<input type="hidden" name="state" value="{{ answers.state }}" />
<label for="city">City (optional)</label>
<input
type="text"
name="city"
id="city"
value="{{ answers.city }}"
placeholder="e.g. ATLANTA"
/>
<button type="submit">Next</button>
</form>
{% elif step == 4 %}
<form method="get" action="{{ url_for('dashboard') }}">
<input type="hidden" name="industry" value="{{ answers.industry }}" />
<input type="hidden" name="state" value="{{ answers.state }}" />
<input type="hidden" name="city" value="{{ answers.city }}" />
<label>What size employer are you looking for?</label>
{% for value, label in [ ('small', 'Small (< 50 approvals)'), ('mid', 'Mid
(50 - 500)'), ('large', 'Large (500+)'), ('any', 'Any'), ] %}
<label>
<input
type="radio"
name="size"
value="{{ value }}"
{%
if
value=""
="answers.size"
or
(not
answers.size
and
value=""
="any"
)
%}checked{%
endif
%}
/>
{{ label }}
</label>
{% endfor %}
<button type="submit">Show my matches</button>
</form>
{% endif %}
</section>
{% endblock %}
Optional — get help from your agent:
Here is my quiz.html. Keep the form logic and the hidden-input passthrough exactly as-is. Help me style the 4 steps so they look like a friendly onboarding flow — progress indicator (Step 2 of 4), bigger labels, nicer buttons. Use my existing style.css.
Phase 5: Build the dashboard
Mixed phase. The
/dashboardroute is thin (reads filters, calls three search functions, renders). The template is mostly agent-assisted — Chart.js is a lot of boilerplate you don't need to hand-type.
Objective
A /dashboard route that shows a stats strip, two Chart.js bar charts, and the results table — all driven by the quiz answers in the URL.
Instructions
Hints
app.py:
from search import (
search_employers, totals, top_employers, approvals_by_state,
)
@app.route("/dashboard")
def dashboard():
filters = {
"industry": request.args.get("industry", ""),
"state": request.args.get("state", ""),
"city": request.args.get("city", ""),
"size": request.args.get("size", ""),
}
return render_template(
"dashboard.html",
title="OPT Pal — Your matches",
filters=filters,
stats=totals(filters),
top=top_employers(filters, 10),
by_state=approvals_by_state(filters),
rows=search_employers(filters, 50),
)
templates/dashboard.html:
{% extends "base.html" %} {% block content %}
<section class="panel">
<a href="{{ url_for('quiz', step=1) }}">← Refine search</a>
<h1>Your matches</h1>
<div class="stat-strip">
<div>
<strong>{{ stats.employers }}</strong>
employers
</div>
<div>
<strong>{{ "{:,}".format(stats.approvals) }}</strong>
total approvals
</div>
<div class="meta">
Filters: {% if filters.industry %}Industry "{{ filters.industry }}"{%
endif %} {% if filters.state %} · State {{ filters.state }}{% endif %} {%
if filters.city %} · City {{ filters.city }}{% endif %} {% if filters.size
%} · Size {{ filters.size }}{% endif %}
</div>
</div>
<canvas id="top-employers-chart" height="120"></canvas>
<canvas id="by-state-chart" height="120"></canvas>
<script id="top-data" type="application/json">
{{ top|tojson }}
</script>
<script id="state-data" type="application/json">
{{ by_state|tojson }}
</script>
</section>
<section class="panel">
<h2>All matches ({{ rows|length }} shown)</h2>
<table>
<thead>
<tr>
<th>Employer</th>
<th>Industry</th>
<th>City</th>
<th>State</th>
<th>Approvals</th>
</tr>
</thead>
<tbody>
{% for r in rows %}
<tr>
<td>{{ r.employer_name }}</td>
<td>{{ r.industry }}</td>
<td>{{ r.city }}</td>
<td>{{ r.state }}</td>
<td>{{ r.total_approvals }}</td>
</tr>
{% endfor %}
</tbody>
</table>
</section>
<script src="https://cdn.jsdelivr.net/npm/chart.js@4.4.0/dist/chart.umd.min.js"></script>
<script>
const top = JSON.parse(document.getElementById("top-data").textContent)
const byState = JSON.parse(document.getElementById("state-data").textContent)
new Chart(document.getElementById("top-employers-chart"), {
type: "bar",
data: {
labels: top.map((r) => r.employer_name),
datasets: [
{ label: "Total approvals", data: top.map((r) => r.total_approvals) },
],
},
options: {
plugins: { title: { display: true, text: "Top 10 employers" } },
},
})
new Chart(document.getElementById("by-state-chart"), {
type: "bar",
data: {
labels: byState.map((r) => r.state),
datasets: [
{
label: "Total approvals",
data: byState.map((r) => r.total_approvals),
},
],
},
options: {
plugins: { title: { display: true, text: "Approvals by state" } },
},
})
</script>
{% endblock %}
Why |tojson inside a <script type="application/json"> tag? Jinja safely serializes Python lists/dicts to a JSON string. Putting it inside an inert application/json script tag (not text/javascript) means it isn't executed — we read it with JSON.parse below. That's safer than writing const top = {{ top|tojson }}; which has XSS quirks.
Optional — get help from your agent (after your route + data shape are working):
My /dashboard route returns stats, top employers, approvals by state, and a rows list. Here is dashboard.html. Help me polish the Chart.js config: consistent colors, rotated x-axis labels so long employer names don't overlap, and a legend off. Don't change my route or the JSON embedded in the template.
Phase 6: Spec + journal + README
Handwrite this yourself. This is your voice on demo day.
Objective
Update the spec to match the new shape, fill in the empty journal, and give the README a short description of what OPT Pal does.
Instructions
Hints
Spec MVP rewrite (cut half the bullets from your current spec):
**Must have (MVP):**
- [ ] **Quiz onboarding:** 4-step wizard collecting industry, state, optional city, and company-size preference
- [ ] **Filtered dashboard:** stats strip (employer count + total approvals), top 10 employers bar chart, approvals-by-state bar chart, and a results table
- [ ] **All filters compose:** every step narrows the same SQL query in `search.py`
- [ ] **Provenance and disclaimer:** USCIS source, fiscal year(s), and a plain-language note that historical data doesn't predict individual outcomes
Commit message idea:
checkpoint 2: search.py + quiz wizard + dashboard with Chart.js
Phase 7 (preview — not required for Checkpoint 2): Plotly US state map
After CP2 ships, the approvals_by_state(filters) function you already wrote becomes the data source for a US state choropleth. Plotly.js loads from a CDN, renders SVG, no npm required.
The sketch (for next week, don't build now):
- New route
/api/state-approvals.jsonthat returnsjsonify(approvals_by_state(filters)) - Dashboard adds a
<div id="us-map">and a Plotlychoroplethtrace withlocationmode: "USA-states",locations: states.map(r => r.state),z: states.map(r => r.total_approvals) - Same filter dict flows through — the map just visualizes what the bar chart already shows
Phase 8 (stretch — demo polish): three-globe or Mapbox
If the state map works smoothly and you have time before May 2, this is the demo-day centerpiece. Pick one of the following:
- Stay 2D: add a second Chart.js line chart of approvals by
fiscal_year, and a "compare two states" side-by-side view. Safer, still impressive. - Go 3D: swap the state map for a three-globe globe with arcs from each employer's state/city to a "you are here" point, or a Mapbox GL JS globe projection with city markers. Needs a Mapbox token for the latter.
Pick the 3D path only if the dashboard MVP is rock solid. Don't chase the globe at the cost of a working quiz.