Project Organization

A six-month-old analysis is, for practical purposes, somebody else's code — even if you wrote it. A small amount of structure goes a long way toward making it survivable.

A starter layout

my-analysis/
├── README.md
├── requirements.txt
├── data/
│   ├── raw/                ← original files, immutable
│   ├── interim/            ← intermediate output
│   └── processed/          ← analysis-ready
├── notebooks/
│   ├── 01-explore.ipynb
│   ├── 02-clean.ipynb
│   └── 03-analysis.ipynb
├── src/
│   ├── load.py
│   ├── clean.py
│   └── plot.py
├── outputs/
│   ├── figures/
│   └── reports/
└── tests/
    └── test_clean.py

You don't need all of this on day one. But knowing where things would go means you'll add them when needed.

What goes where

data/raw/ — exactly what you received. Never touched again.
data/interim/ — after parsing, type-fixing, removing obviously-wrong rows.
data/processed/ — the analysis-ready dataset.
notebooks/ — exploration and narrative. Numbered so the reading order is obvious.
src/ — reusable functions imported from notebooks. When a notebook cell grows past ~30 lines or you copy-paste it, promote it to src/.
outputs/ — generated charts and final reports.
tests/ — assertions about your data and helper functions.

The notebook-naming trick

Prefix notebooks with numbers:

01-explore.ipynb
02-clean.ipynb
03-eda.ipynb
04-models.ipynb

This guarantees the reading order. It also makes inserting a new step easy — 02.5-fix-encoding.ipynb if you must.

When to graduate to scripts

The crossover happens when:

You re-run the same analysis weekly.
Multiple notebooks need the same cleaning step.
An engineer needs to call your logic from production code.

Imports across notebooks and src

When you move helpers into src/clean.py:

# src/clean.py
import pandas as pd

def standardize_names(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df.columns = df.columns.str.lower().str.replace(" ", "_")
    return df

You can import them from any notebook:

# in a notebook
import sys
sys.path.insert(0, "..")   # or use a proper package install
from src.clean import standardize_names

df = standardize_names(df)

This refactor pays dividends as soon as you need the same logic in two places.

A minimal `README.md`

Every project deserves a README, even a small one:

# Q1 Sales Analysis

## What this is
An exploratory analysis of Q1 2024 sales by region and product
line.

## How to run
1. Put raw sales export in `data/raw/sales_q1_2024.csv`
2. Install: `pip install -r requirements.txt`
3. Run notebooks in order: 01 → 02 → 03.
4. Final figures land in `outputs/figures/`.

## Key contacts
- Data source: revops@example.com
- Analyst: aiko@example.com

Anyone (including future-you) can rerun the project in 60 seconds.

Naming conventions

Files: lowercase_with_underscores.csv
Variables: lowercase_with_underscores
Functions: lowercase_with_underscores
DataFrames: descriptive (employees, orders, cleaned_orders), not df, df2, df_final.

A name like df_v3_final_REAL_FIXED is a smell. If you need versions, use git.

Git — the missing tool

Use version control for analysis projects, just like engineers do:

git init at the project root.
.gitignore excludes data/raw/ and data/processed/ (data should not be in git for size/PII reasons).
Commit frequently, with descriptive messages.

Even a one-person analysis benefits — you can roll back when you break something.

A small `.gitignore`

# Don't commit data
data/raw/
data/interim/
data/processed/

# Python
__pycache__/
*.pyc
.ipynb_checkpoints/

# Notebooks output (optional, divisive — pick a side)
# *.ipynb_checkpoints

Tests for analysis

You don't need 100% test coverage — but a few critical assertions are gold.

# tests/test_clean.py
import pandas as pd
from src.clean import standardize_names

def test_standardize_names_lowercases():
    df = pd.DataFrame({"First Name": [1], "Last Name": [2]})
    out = standardize_names(df)
    assert list(out.columns) == ["first_name", "last_name"]

If clean.py changes, this test will catch the regression.

Documentation lives with the code

Update the README, update notebook markdown cells, update function docstrings — whenever you change the logic, update the words that describe it.

The single greatest cause of "this codebase is a nightmare" is documentation that used to be true.

Check your understanding

QuestionSelect one

Why number your notebooks (01-load, 02-clean, ...)?

It is required by Jupyter

It makes them faster

It makes the reading and execution order obvious — important when a future reader (or you, in six months) needs to retrace the analysis

It saves memory

QuestionSelect one

When should logic move out of a notebook and into a .py file in src/?

Never — notebooks are the deliverable

Immediately, always

When the same logic is needed in multiple notebooks, or when a cell grows large enough to be its own well-named function

When the cell crashes

QuestionSelect one

Why is committing raw data to git generally a bad idea?

Git cannot handle CSV

It is illegal

Data files are often large, change frequently, and may contain PII or confidential information — git is for code, data belongs in dedicated storage

It slows down notebooks

QuestionSelect one

Which name for a DataFrame is best?

df_FINAL_v3_real

df2

cleaned_orders — a descriptive name that says what is in it

final

A starter layout

What goes where

The notebook-naming trick

When to graduate to scripts

Imports across notebooks and src

A minimal README.md

Naming conventions

Git — the missing tool

A small .gitignore

Tests for analysis

Documentation lives with the code

Check your understanding

Project Organization

On this page