Reproducible Experiments
Seeds, environments, and provenance — make your science survive contact with the future
A scientific result you cannot reproduce six months later is not science — it is folklore. Reproducibility is the discipline of making your computational experiments behave the same way every time, on every machine, for every collaborator, forever.
This chapter is unusual: it has fewer equations and more hygiene. The payoff is that everything you build will still work after you forget how it works.
What can go wrong
Any one of these can make a number change in the third decimal — or change a conclusion entirely.
The four pillars
- Pin the environment. Lock package versions in a manifest.
- Seed the randomness. Make every stochastic step deterministic.
- Record inputs and outputs. Know what data went in and what came out.
- Version everything. Code, configs, and small data go in git; large data goes in content-addressed storage.
Pillar 1: Pin the environment
The minimum acceptable manifest is a requirements.txt produced
by pip freeze. Even better, use a lockfile tool that captures
transitive dependencies exactly — pip-tools, poetry,
uv, or conda-lock.
A typical pyproject.toml snippet for a scientific project:
Three lessons:
- Exact pins (
==) for everything you actually use, not just ranges - A lockfile that pins transitive dependencies too
- The Python version itself, including the patch
Pillar 2: Seed everything stochastic
NumPy's default_rng makes this clean: create one generator at
the top of the experiment, thread it through everywhere.
Two anti-patterns to avoid:
- Never use the global
np.random.*functions in research code. They share hidden state and make ordering matter in invisible ways. - Never seed with
time.time(). That is the opposite of reproducibility.
When using multiple parallel workers, derive child seeds with
SeedSequence so each worker has a deterministic stream:
Every worker is reproducible, every workflow is reproducible, and re-ordering workers doesn't change the answer.
Pillar 3: Record inputs and outputs
A reproducible experiment carries a manifest — a small text file that says exactly what was run with what parameters and what came out. Hash large inputs so you'd notice if they changed.
Drop this manifest.json alongside every result. Years later,
you can verify that the inputs were the ones you think they were
and re-run with the same parameters.
Pillar 4: Version your code
Use git. Tag the commit that produced every published result:
git commit -am "fig-3 simulation, ran with config v2"
git tag fig-3-v2Record the commit hash in the manifest. This pairs the exact code with the exact outputs forever.
Parameter sweeps the disciplined way
Most experimental work is a parameter sweep: vary one or two inputs, record the outputs, plot. Resist the temptation to do this with copy-paste cells in a notebook. Instead, separate configuration from code, and let a small driver loop over configs.
This pattern scales: change configs to a list of YAML files,
swap the loop for joblib.Parallel, and you have a
production-grade sweep that still gives identical results to the
single-threaded version.
Notebooks vs scripts
Notebooks are wonderful for exploration and terrible for reproducibility: cell order is implicit, hidden state lurks everywhere, and the JSON format is murder to diff in git.
A practical rule:
- Notebook for figuring out what to compute and looking at results
- Module for computing it, with one entry point per experiment
- Notebook at the end imports the module and displays the results
The "library + thin notebook" pattern keeps the reproducible parts
of your work in plain .py files you can test, version, and reuse.
A reproducibility scorecard
Rate your project on these axes — every project should aim for at least 4/5:
- Anyone can clone the repo and run
python run_experiment.pywith no manual steps - Dependency versions are pinned in a lockfile
- Every random source is seeded
- Inputs are either committed or hashed in the manifest
- Every result file links back to a git commit
If you can tick all five, your work will reproduce in 2030 on a machine that does not yet exist.
Check your understanding
Your colleague ships you a script that calls np.random.normal(...) directly. You run it twice in a row and get different results. What is the cleanest fix?
Set np.random.seed(...) once at the top of the script
Refactor the script to create a single rng = np.random.default_rng(seed) at the top and pass it to every function that needs randomness
Run the script in a fresh Python interpreter each time
Catch the difference and average it away
You publish a paper whose key figure was produced by a notebook with 47 cells. Six months later a referee asks for a small variation. What is the most reproducibility-friendly response?
Email the notebook and ask the referee to figure it out
Maintain the figure's computation in a versioned .py module with a single entry function, parametrized by the variation; re-run with the new parameter, regenerate the figure, and tag the commit
Re-run all 47 cells and hope they still work
Ask the referee to install your old laptop