Data Science Setup Guide

Data science projects let you explore real-world datasets, find patterns, and create visualizations that tell a story. Python is one of the most popular languages for this, and the core libraries — Pandas and Matplotlib — are beginner-friendly once you learn the basics.

If you have never worked with data analysis before, that is okay. You do not need statistics knowledge. You need a dataset you find interesting and curiosity about what is in it.

Quick Setup

If you have not used uv before, review the Packages and uv guide first.

Open your terminal, navigate to your project folder, and run:

uv init
uv add pandas matplotlib

Create a file called analysis.py:

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/sample.csv")

print(df.head())
print(df.describe())

Run it:

uv run python analysis.py

This loads a CSV file, prints the first few rows, and shows basic statistics. You will need an actual CSV file in your data/ folder — see "Finding Data" below.

Optional: Jupyter Notebooks

Some people prefer working in notebooks where you can run code one cell at a time and see charts inline. To use Jupyter:

uv add jupyter
uv run jupyter notebook

This opens a browser window where you can create .ipynb notebook files. Notebooks are great for exploration, but your final project should also have a regular Python script that someone can run from the terminal.

Key Concepts

DataFrames are the core data structure in Pandas. Think of a DataFrame as a spreadsheet — rows and columns with labels. When you load a CSV file, each row becomes a row in the DataFrame and each column becomes a named column.

df = pd.read_csv("data/movies.csv")

df.head()           # first 5 rows
df.shape             # (num_rows, num_columns)
df.columns           # list of column names
df["title"]          # one column
df[df["rating"] > 8] # filter rows

Matplotlib creates charts and graphs. The basic pattern is: prepare your data, call a plot function, then show or save it.

import matplotlib.pyplot as plt

df["rating"].hist()
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.savefig("output/ratings.png")
plt.show()

Finding Data

The most important decision for a data science project is picking a good dataset. Here is what to look for:

CSV format — easiest to work with in Pandas
Under 10 MB — keeps things fast and simple
A topic you care about — you will spend weeks with this data, so pick something you find interesting

Where to find datasets:

Kaggle Datasets — thousands of free datasets on every topic. Search, filter by size and format, and download as CSV.
FiveThirtyEight Data — datasets behind FiveThirtyEight articles (sports, politics, culture)
Data.gov — US government open data (weather, education, health)
Awesome Public Datasets — a curated list organized by topic

Download a CSV and put it in your data/ folder. Then load it with pd.read_csv("data/your_file.csv").

Recommended Project Structure

my-project/
├── analysis.py         # Main script (load, clean, analyze, visualize)
├── helpers.py          # Your business logic (cleaning functions, calculations)
├── data/               # Raw data files (CSV, JSON)
│   └── dataset.csv
├── output/             # Generated charts and reports
│   └── chart.png
├── pyproject.toml      # Dependencies (created by uv)
└── README.md

Put your data loading and chart creation in analysis.py. Put your custom analysis functions (cleaning, filtering, calculations) in helpers.py. Save charts to output/ so you can include them in your README and presentation.

Project Ideas

Here are some ideas that are realistic for a 3-week timeline. Pick a dataset you find interesting and scope it to the simplest version that produces useful insights.

Spotify Listening History Analyzer — Download your Spotify data (or use a public dataset from Kaggle) and analyze your listening habits. What artists do you listen to most? When do you listen? How has your taste changed? MVP: load the data, show top 10 artists in a bar chart, and one time-based trend chart.

Sports Stats Explorer — Pick a sport and a public stats dataset. Compare players, teams, or seasons. MVP: load a CSV, calculate a few interesting stats (averages, rankings), and create 2-3 comparison charts.

Weather Trends Visualizer — Use historical weather data for a city you care about. Show temperature trends, rainfall patterns, or compare seasons. MVP: load weather CSV, plot monthly average temperature for one year, add a second chart for precipitation.

Survey or Poll Analyzer — Find a public survey dataset (Kaggle has many) and analyze the responses. Who participated? What were the most common answers? Are there interesting correlations? MVP: load the data, clean it, create 3-4 charts that tell a story about the results.

Tutorials and Resources

Kaggle Learn: Pandas — free interactive course, covers DataFrames, indexing, grouping, and data types with hands-on exercises
Matplotlib Quick Start — official overview of how Matplotlib works (Figures, Axes, plot types, styling)
Matplotlib Pyplot Tutorial — step-by-step tutorial for line plots, scatter plots, formatting, and subplots
Real Python: Pandas for Data Science — curated learning path from basics through cleaning, visualization, and grouping

Agent Prompts

Use these prompts with your coding agent (Cursor, Claude Code, etc.) to get started. Copy and paste them, then modify to fit your project.

Explore your dataset:

Read my project.spec.md. I have a CSV file at data/[filename].csv. Load it with Pandas, show me the first few rows, list the columns and their data types, and check for missing values. Explain what each column likely represents and suggest 3-4 interesting questions I could answer with this data.

Clean and prepare your data:

Look at my analysis.py and the dataset in data/. The data has some issues — [describe problems like missing values, weird formats, etc.]. Help me write a cleaning function in helpers.py that fixes these issues. Explain each step so I understand what the cleaning does and why.

Create a visualization:

I want to create a bar chart showing [what you want to visualize] from my dataset. Look at my analysis.py and show me how to use Matplotlib to create this chart with a title, axis labels, and clean formatting. Save it to output/ as a PNG. Explain the Matplotlib functions you use.

Tell a story with your data:

Look at my analysis.py and the charts I have created so far. Help me think about what story my data tells. What are the most interesting findings? Suggest 1-2 more charts that would strengthen the narrative. Help me write a summary paragraph I can put in my README.