> Source URL: /unit-3/resources/data-science-setup.guide
# Data Science Setup Guide

Data science projects let you explore real-world datasets, find patterns, and create visualizations that tell a story. Python is one of the most popular languages for this, and the core libraries — Pandas and Matplotlib — are beginner-friendly once you learn the basics.

If you have never worked with data analysis before, that is okay. You do not need statistics knowledge. You need a dataset you find interesting and curiosity about what is in it.

## Quick Setup

If you have not used `uv` before, review the [Packages and uv](../../resources/packages.guide.md) guide first.

Open your terminal, navigate to your project folder, and run:

```bash
uv init
uv add pandas matplotlib
```

Create a file called `analysis.py`:

```python
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv("data/sample.csv")

print(df.head())
print(df.describe())
```

Run it:

```bash
uv run python analysis.py
```

This loads a CSV file, prints the first few rows, and shows basic statistics. You will need an actual CSV file in your `data/` folder — see "Finding Data" below.

## Optional: Jupyter Notebooks

Some people prefer working in notebooks where you can run code one cell at a time and see charts inline. To use Jupyter:

```bash
uv add jupyter
uv run jupyter notebook
```

This opens a browser window where you can create `.ipynb` notebook files. Notebooks are great for exploration, but your final project should also have a regular Python script that someone can run from the terminal.

## Key Concepts

**DataFrames** are the core data structure in Pandas. Think of a DataFrame as a spreadsheet — rows and columns with labels. When you load a CSV file, each row becomes a row in the DataFrame and each column becomes a named column.

```python
df = pd.read_csv("data/movies.csv")

df.head()           # first 5 rows
df.shape             # (num_rows, num_columns)
df.columns           # list of column names
df["title"]          # one column
df[df["rating"] > 8] # filter rows
```

**Matplotlib** creates charts and graphs. The basic pattern is: prepare your data, call a plot function, then show or save it.

```python
import matplotlib.pyplot as plt

df["rating"].hist()
plt.title("Distribution of Ratings")
plt.xlabel("Rating")
plt.ylabel("Count")
plt.savefig("output/ratings.png")
plt.show()
```

## Finding Data

The most important decision for a data science project is picking a good dataset. Here is what to look for:

- **CSV format** — easiest to work with in Pandas
- **Under 10 MB** — keeps things fast and simple
- **A topic you care about** — you will spend weeks with this data, so pick something you find interesting

**Where to find datasets:**

- [Kaggle Datasets](https://www.kaggle.com/datasets) — thousands of free datasets on every topic. Search, filter by size and format, and download as CSV.
- [FiveThirtyEight Data](https://github.com/fivethirtyeight/data) — datasets behind FiveThirtyEight articles (sports, politics, culture)
- [Data.gov](https://data.gov/) — US government open data (weather, education, health)
- [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets) — a curated list organized by topic

Download a CSV and put it in your `data/` folder. Then load it with `pd.read_csv("data/your_file.csv")`.

## Recommended Project Structure

```
my-project/
├── analysis.py         # Main script (load, clean, analyze, visualize)
├── helpers.py          # Your business logic (cleaning functions, calculations)
├── data/               # Raw data files (CSV, JSON)
│   └── dataset.csv
├── output/             # Generated charts and reports
│   └── chart.png
├── pyproject.toml      # Dependencies (created by uv)
└── README.md
```

Put your data loading and chart creation in `analysis.py`. Put your custom analysis functions (cleaning, filtering, calculations) in `helpers.py`. Save charts to `output/` so you can include them in your README and presentation.

## Project Ideas

Here are some ideas that are realistic for a 3-week timeline. Pick a dataset you find interesting and scope it to the simplest version that produces useful insights.

**Spotify Listening History Analyzer** — Download your Spotify data (or use a public dataset from Kaggle) and analyze your listening habits. What artists do you listen to most? When do you listen? How has your taste changed? *MVP: load the data, show top 10 artists in a bar chart, and one time-based trend chart.*

**Sports Stats Explorer** — Pick a sport and a public stats dataset. Compare players, teams, or seasons. *MVP: load a CSV, calculate a few interesting stats (averages, rankings), and create 2-3 comparison charts.*

**Weather Trends Visualizer** — Use historical weather data for a city you care about. Show temperature trends, rainfall patterns, or compare seasons. *MVP: load weather CSV, plot monthly average temperature for one year, add a second chart for precipitation.*

**Survey or Poll Analyzer** — Find a public survey dataset (Kaggle has many) and analyze the responses. Who participated? What were the most common answers? Are there interesting correlations? *MVP: load the data, clean it, create 3-4 charts that tell a story about the results.*

## Tutorials and Resources

- [Kaggle Learn: Pandas](https://www.kaggle.com/learn/pandas) — free interactive course, covers DataFrames, indexing, grouping, and data types with hands-on exercises
- [Matplotlib Quick Start](https://matplotlib.org/stable/users/explain/quick_start.html) — official overview of how Matplotlib works (Figures, Axes, plot types, styling)
- [Matplotlib Pyplot Tutorial](https://matplotlib.org/stable/tutorials/pyplot.html) — step-by-step tutorial for line plots, scatter plots, formatting, and subplots
- [Real Python: Pandas for Data Science](https://realpython.com/learning-paths/pandas-data-science/) — curated learning path from basics through cleaning, visualization, and grouping

## Agent Prompts

Use these prompts with your coding agent (Cursor, Claude Code, etc.) to get started. Copy and paste them, then modify to fit your project.

**Explore your dataset:**

```text
Read my project.spec.md. I have a CSV file at data/[filename].csv. Load it with Pandas, show me the first few rows, list the columns and their data types, and check for missing values. Explain what each column likely represents and suggest 3-4 interesting questions I could answer with this data.
```

**Clean and prepare your data:**

```text
Look at my analysis.py and the dataset in data/. The data has some issues — [describe problems like missing values, weird formats, etc.]. Help me write a cleaning function in helpers.py that fixes these issues. Explain each step so I understand what the cleaning does and why.
```

**Create a visualization:**

```text
I want to create a bar chart showing [what you want to visualize] from my dataset. Look at my analysis.py and show me how to use Matplotlib to create this chart with a title, axis labels, and clean formatting. Save it to output/ as a PNG. Explain the Matplotlib functions you use.
```

**Tell a story with your data:**

```text
Look at my analysis.py and the charts I have created so far. Help me think about what story my data tells. What are the most interesting findings? Suggest 1-2 more charts that would strengthen the narrative. Help me write a summary paragraph I can put in my README.
```


---

## Backlinks

The following sources link to this document:

- [Data Science Setup Guide](/unit-3/projects/final-project-checkpoint-1.project.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/aiden-p/aiden-p-2026-04-14.guide.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/thu-h/thu-h-2026-04-14.guide.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/nate-m/nate-m-2026-04-14.guide.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/nate-m/nate-m-2026-04-18.guide.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/yeoram-k/yeoram-k-2026-04-18.guide.llm.md)
- [Data Science Setup Guide](/unit-3/project-paths/yeoram-k/yeoram-k-2026-04-14.guide.llm.md)
