The Bootstrap

Resampling your own data with replacement to approximate a sampling distribution — getting standard errors and confidence intervals for awkward statistics like the median, with no formulas and no normality assumptions.

In Confidence Intervals every interval leaned on a formula: a known standard error and a critical value from the t or normal distribution. That works beautifully for the mean. But what's the standard error of a median? A trimmed mean? A correlation, a ratio of two quantities, the 90th percentile? For most interesting statistics the textbook formula is ugly, fragile, or simply doesn't exist — and the normality assumptions behind the clean formulas may not hold for your messy data anyway.

The bootstrap is the workaround, and it's one of the most liberating ideas in modern statistics: instead of deriving a formula, you let the computer simulate the sampling distribution by resampling your own data with replacement. With a few lines of NumPy you get a standard error and a confidence interval for almost any statistic — no calculus, no distributional assumptions, no lookup tables.

The core idea: your sample stands in for the population

Recall the fundamental problem from Sampling Distributions: to know how much a statistic wobbles, you'd want to draw many fresh samples from the population and watch the statistic vary. But you can't — you only have one sample, and the population is hidden.

The bootstrap's trick is a bold substitution: treat your sample as if it were the population. Your sample is your single best picture of the population, so drawing new samples from your sample mimics drawing new samples from the population. The only wrinkle is that to keep each resample the same size as the original, you must draw with replacement — the same observation can appear more than once (and some won't appear at all). That's what makes each resample different and recreates the sampling variability.

The dashed arrow is the move you can't make (redraw from the population); every solid arrow is something you can do on your laptop. The spread of the bootstrap distribution estimates the standard error, and its percentiles give a confidence interval directly.

Why 'with replacement' is non-negotiable

If you resampled without replacement at the original size, you'd just shuffle the same values and get the identical statistic every time — zero variability, useless. Sampling with replacement lets some points repeat and others drop out, so each resample is a slightly different "plausible world." That variation across resamples is exactly what mimics drawing fresh samples from the population.

The algorithm in plain words

The percentile bootstrap is short enough to state in full:

Start with your observed sample of n values.
Draw a resample: pick n values from it with replacement.
Compute your statistic (mean, median, whatever) on that resample and record it.
Repeat steps 2-3 a large number of times — call it B (often 2,000-10,000).
The recorded values form the bootstrap distribution. Its standard deviation estimates the standard error; its 2.5th and 97.5th percentiles are a 95% confidence interval.

That's it. The same five steps work for any statistic — you only change the function in step 3.

Bootstrapping a mean (and checking it)

Let's implement it from scratch with NumPy. rng.choice(data, size=len(data), replace=True) draws one resample. We loop B times, take the mean each time, and read off the percentiles. As a sanity check, we'll compare the bootstrap CI for the mean against the classic t-interval from Confidence Intervals — they should land in nearly the same place, which is reassuring evidence the bootstrap isn't magic, just simulation.

The bootstrap standard error and the s / √n formula match closely, and the two intervals overlap almost exactly. For the mean, the bootstrap is just a more roundabout way to get the familiar answer — the payoff comes when there is no familiar answer.

Vectorize when B is large

The explicit loop is the clearest way to learn the bootstrap, and it's fast enough for these examples. When you need speed, draw all resamples at once: idx = rng.integers(0, n, size=(B, n)) then stats[i] = data[idx].mean(axis=1). Same idea, no Python loop. Start with the loop for understanding; reach for the vectorized form in production.

The payoff: a CI for the median, which has no easy formula

The mean has a tidy standard error. The median does not — its formula is awkward and depends on the unknown density of the data at the median. With the bootstrap you don't care: you resample, take the median of each resample, and read off the percentiles. The procedure is identical to the mean case except for one word.

The histogram is the sampling distribution of the median, reconstructed from a single dataset. The red lines mark the 2.5th and 97.5th percentiles — the 95% CI. You'll often notice the bootstrap distribution of the median looks chunky or stepped; that's a real property of the median on a finite sample (it can only land on actual data values), and it's a hint about one of the bootstrap's limits, which we'll get to.

One engine, any statistic

The same five lines compute a CI for the mean, the median, a trimmed mean, the 90th percentile, a correlation, or a ratio — swap the function in the resample loop and nothing else changes. That generality is why the bootstrap is a data scientist's Swiss-army knife: when you can't find (or trust) a formula, you can almost always bootstrap.

When to use the bootstrap — and when to be careful

The bootstrap is general, but it is not magic. It estimates sampling variability by reusing the data you have, so it inherits that data's flaws and runs out of information when the sample is thin.

Reach for it when:

Your statistic has no clean standard-error formula (median, trimmed mean, ratio, correlation, percentile, Gini, etc.).
The assumptions behind a formula are shaky — skewed data, unknown shape — and you'd rather not trust a normal approximation.
You want a quick, assumption-light sanity check on a formula-based interval.

Be cautious — or don't — when:

n is very small. With, say, 8 points, your sample is a poor stand-in for the population, so the bootstrap distribution is built on too little information. It can't conjure detail that isn't there.
The statistic depends on extremes, like the maximum or minimum. Resampling can never produce a value larger than the biggest observation, so the bootstrap badly misrepresents the distribution of a max. Heavy-tailed data hurts for the same reason.
The data aren't independent (time series, clustered/grouped data). Plain resampling scrambles the structure; you'd need a specialized variant (block bootstrap), which is beyond this page.

The bootstrap cannot create information

The single most important caveat: the bootstrap does not add new data and does not fix a bad sample. If your sample is biased — collected unfairly, or too small to represent the population — every resample carries that same bias, and the bootstrap will hand you a confident interval around the wrong center. Resampling estimates precision (how much the statistic wobbles), not accuracy (whether you're aimed at the truth). Garbage in, confidently-quantified garbage out.

Practice the bootstrap

A single data sample has been created for you. Build a percentile bootstrap 95% confidence interval for the median and store it as a tuple ci = (low, high).

Steps:

Use the provided rng and B = 4000 resamples.
Each iteration: resample = rng.choice(data, size=len(data), replace=True), then record np.median(resample).
ci = the 2.5th and 97.5th percentiles of the recorded medians, as a tuple of two floats with ci[0] < ci[1].

Use np.percentile(boot, [2.5, 97.5]).

Using the provided data and rng, build the bootstrap distribution of the mean with B = 4000 resamples, then return the bootstrap standard error — the standard deviation of the bootstrap means — in a float variable named boot_se.

Each iteration resamples len(data) values with replacement and records the mean.
boot_se = float(np.std(boot_means, ddof=1)).

The bootstrap SE should land close to the textbook data.std(ddof=1) / sqrt(len(data)).

More resamples reduces simulation noise (not sampling noise)

B — the number of resamples — controls only the Monte Carlo noise of the bootstrap itself: a bigger B makes your estimated SE and CI stable across reruns, but it does not make your estimate more accurate. The accuracy is capped by your original sample size n. Increasing B from 1,000 to 100,000 smooths the answer; only collecting more real data (n) actually sharpens it. A few thousand resamples is usually plenty.

Check your understanding

QuestionSelect one

What is the defining mechanic of the bootstrap?

Drawing many fresh samples directly from the population

Repeatedly resampling your observed data with replacement and recomputing the statistic on each resample

Fitting a normal distribution to the data and reading off percentiles

Removing one observation at a time and refitting

QuestionSelect one

Why must bootstrap resampling be done with replacement (at the original sample size)?

Because sampling without replacement would be too slow to compute

Without replacement at size $n$ you'd just reorder the same values and get the identical statistic every time, producing zero variability

Because replacement guarantees each resample contains every original value

Because it makes the resamples larger than the original sample

QuestionSelect one

The bootstrap is especially valuable for a statistic like the median mainly because:

The median is impossible to compute without resampling

The median has no simple standard-error formula, yet the bootstrap gives it a CI with the same procedure used for any other statistic

The median is always normally distributed, so the formula is easy

The bootstrap makes the median more accurate than the mean

QuestionSelect one

For which statistic is the ordinary bootstrap least trustworthy?

The sample mean of 500 observations

The median of 200 observations

The maximum of the sample

A trimmed mean of 300 observations

QuestionSelect one

Your sample of 40 sessions was accidentally drawn only from power users, so it's biased toward heavy usage. You bootstrap a 95% CI for mean session length. What does that interval tell you?

The interval corrects for the sampling bias and recovers the true population mean

The interval is invalid and cannot be computed

It quantifies the precision of the mean for this biased sample, but it's centered on the wrong value — the bootstrap estimates wobble, not whether you're aimed at the truth

A larger number of resamples B would remove the bias

Key takeaways

The bootstrap resamples your data with replacement, treating the sample as a stand-in for the population, to approximate a statistic's sampling distribution — no formula, no normality assumption.
The standard deviation of the bootstrap distribution estimates the standard error; its 2.5th and 97.5th percentiles give a 95% percentile confidence interval.
It works for any statistic — median, trimmed mean, ratio, correlation, percentile — by swapping one function in the resample loop.
It struggles with very small n, extreme-based statistics (max/min), heavy tails, and dependent data.
It estimates precision, not accuracy: it does not create new information or fix a biased sample. B controls only simulation noise; only more real data improves the estimate.

You now have two complementary routes to a confidence interval: the formula-based intervals of Confidence Intervals when assumptions hold, and the assumption-light bootstrap when they don't. Both rest on the same foundation — the Sampling Distributions and Standard Error ideas that describe how a statistic varies from sample to sample.

The Bootstrap

The core idea: your sample stands in for the population

The algorithm in plain words

Bootstrapping a mean (and checking it)

The payoff: a CI for the median, which has no easy formula

When to use the bootstrap — and when to be careful

Practice the bootstrap

Check your understanding

On this page