Reproducible Experiments

Seeds, environments, and provenance — make your science survive contact with the future

A scientific result you cannot reproduce six months later is not science — it is folklore. Reproducibility is the discipline of making your computational experiments behave the same way every time, on every machine, for every collaborator, forever.

This chapter is unusual: it has fewer equations and more hygiene. The payoff is that everything you build will still work after you forget how it works.

What can go wrong

Any one of these can make a number change in the third decimal — or change a conclusion entirely.

The four pillars

Pin the environment. Lock package versions in a manifest.
Seed the randomness. Make every stochastic step deterministic.
Record inputs and outputs. Know what data went in and what came out.
Version everything. Code, configs, and small data go in git; large data goes in content-addressed storage.

Pillar 1: Pin the environment

The minimum acceptable manifest is a requirements.txt produced by pip freeze. Even better, use a lockfile tool that captures transitive dependencies exactly — pip-tools, poetry, uv, or conda-lock.

A typical pyproject.toml snippet for a scientific project:

Three lessons:

Exact pins (==) for everything you actually use, not just ranges
A lockfile that pins transitive dependencies too
The Python version itself, including the patch

Pillar 2: Seed everything stochastic

NumPy's default_rng makes this clean: create one generator at the top of the experiment, thread it through everywhere.

Two anti-patterns to avoid:

Never use the global np.random.* functions in research code. They share hidden state and make ordering matter in invisible ways.
Never seed with time.time(). That is the opposite of reproducibility.

When using multiple parallel workers, derive child seeds with SeedSequence so each worker has a deterministic stream:

Every worker is reproducible, every workflow is reproducible, and re-ordering workers doesn't change the answer.

Pillar 3: Record inputs and outputs

A reproducible experiment carries a manifest — a small text file that says exactly what was run with what parameters and what came out. Hash large inputs so you'd notice if they changed.

Drop this manifest.json alongside every result. Years later, you can verify that the inputs were the ones you think they were and re-run with the same parameters.

Pillar 4: Version your code

Use git. Tag the commit that produced every published result:

git commit -am "fig-3 simulation, ran with config v2"
git tag fig-3-v2

Record the commit hash in the manifest. This pairs the exact code with the exact outputs forever.

Parameter sweeps the disciplined way

Most experimental work is a parameter sweep: vary one or two inputs, record the outputs, plot. Resist the temptation to do this with copy-paste cells in a notebook. Instead, separate configuration from code, and let a small driver loop over configs.

This pattern scales: change configs to a list of YAML files, swap the loop for joblib.Parallel, and you have a production-grade sweep that still gives identical results to the single-threaded version.

Notebooks vs scripts

Notebooks are wonderful for exploration and terrible for reproducibility: cell order is implicit, hidden state lurks everywhere, and the JSON format is murder to diff in git.

A practical rule:

Notebook for figuring out what to compute and looking at results
Module for computing it, with one entry point per experiment
Notebook at the end imports the module and displays the results

The "library + thin notebook" pattern keeps the reproducible parts of your work in plain .py files you can test, version, and reuse.

A reproducibility scorecard

Rate your project on these axes — every project should aim for at least 4/5:

Anyone can clone the repo and run python run_experiment.py with no manual steps
Dependency versions are pinned in a lockfile
Every random source is seeded
Inputs are either committed or hashed in the manifest
Every result file links back to a git commit

If you can tick all five, your work will reproduce in 2030 on a machine that does not yet exist.

Check your understanding

QuestionSelect one

Your colleague ships you a script that calls np.random.normal(...) directly. You run it twice in a row and get different results. What is the cleanest fix?

Set np.random.seed(...) once at the top of the script

Refactor the script to create a single rng = np.random.default_rng(seed) at the top and pass it to every function that needs randomness

Run the script in a fresh Python interpreter each time

Catch the difference and average it away

QuestionSelect one

You publish a paper whose key figure was produced by a notebook with 47 cells. Six months later a referee asks for a small variation. What is the most reproducibility-friendly response?

Email the notebook and ask the referee to figure it out

Maintain the figure's computation in a versioned .py module with a single entry function, parametrized by the variation; re-run with the new parameter, regenerate the figure, and tag the commit

Re-run all 47 cells and hope they still work

Ask the referee to install your old laptop

Scientific Visualization

Plot the right thing the right way — communicate data and computations clearly

Capstone Simulation

A complete predator–prey + parameter-sweep mini-project tying it all together

Reproducible Experiments

On this page