Confidence Intervals
What a confidence interval really is — a point estimate plus a margin of error that expresses precision — and the one interpretation almost everyone gets wrong, built from a coverage simulation you can run yourself.
A single number is a confident-sounding lie. When you report "average revenue per user is $48.20," you've collapsed a noisy estimate into a point that looks exact — but a different sample would have given $46.10 or $50.05. In Populations and Samples we named that wobble estimation error, and in Standard Error we measured its size. A confidence interval is what you report instead of the bare point: a range that carries the uncertainty with it, so your reader knows how precise the estimate actually is.
The shape is always the same — point estimate ± margin of error — and it answers a specific question: given the noise in my sample, what range of values for the true parameter is plausible? This page builds that intuition, then spends most of its energy on the single most misunderstood sentence in all of applied statistics: what "95% confidence" actually means.
A point estimate alone hides its own uncertainty
You sample 50 users, the mean is $48.20. That's your point estimate — your single best guess for the population mean μ. But you already know x̄ is a random variable: it has a standard error, a typical sample-to-sample wobble. The point estimate throws that information away. The confidence interval keeps it.
Read the diagram left to right: take your estimate, attach a margin of error built from the standard error, and you get a band of plausible values for the parameter. A wide band says "I'm not sure"; a narrow band says "I've pinned this down." The interval turns a single number into an honest number.
The anatomy of every confidence interval
Every CI you will ever build has three ingredients:
point estimate (where you think the parameter is),
standard error (how much your estimate wobbles), and a
critical value (how many standard errors wide to make the band, set
by the confidence level). Combine them: estimate ± (critical value × standard error). Memorize the
shape, not any single formula — the shape is universal.
Building a CI for a mean
For a population mean, the point estimate is x̄ and the standard
error is s / √n (sample standard deviation over root-n). Because
we estimate σ from the data, the right critical value comes from
the t-distribution with n − 1 degrees of freedom, not the normal —
the t is slightly wider to pay for that extra uncertainty, and the
difference matters most when n is small.
scipy.stats does the arithmetic for you. stats.t.interval takes the
confidence level, the degrees of freedom, the center, and the standard
error, and hands back the two endpoints.
The margin of error is half the width — the ± part. Notice it is
built entirely from things you can compute from the sample (s, n)
and the confidence level (the critical value). If you want the critical
value explicitly, it's stats.t.ppf(0.975, df=n-1) for a 95% interval
(0.975 because 2.5% is left in each tail), and the CI is x_bar ± t_crit * se. The two approaches give the identical interval.
Why t and not z (normal)?
You use the t-distribution whenever you estimate the standard
deviation from the same sample — which is essentially always in
practice. The t has heavier tails to account for that extra guesswork,
so its intervals are a touch wider, especially for small n. By the
time n is in the hundreds, t and normal are nearly identical, but
reaching for stats.t.interval is the safe default that's never wrong.
What "95% confidence" really means
Here is the sentence almost everyone gets wrong. A 95% confidence level is a property of the procedure, not of any single interval you computed.
If you repeated the entire process — draw a fresh sample, build a 95% interval from it — over and over, about 95% of those intervals would contain the true parameter.
The confidence is in the recipe's long-run hit rate, not in your one particular interval. Your specific interval either contains μ or it doesn't — you just don't know which, because μ is fixed and invisible. The randomness lives in the interval (it jumps around with your sample), not in the parameter (it sits still).
The diagram is the whole idea: μ never moves, but each sample produces a different interval. Most cover the truth; an unlucky ~5% miss. "95% confidence" is a promise about that picture — about the collection of intervals the procedure would generate — not a probability statement about the one interval in your notebook.
The #1 misinterpretation (read this twice)
"There is a 95% probability that the true mean lies in this interval (48.2, 53.9)" is wrong under the standard (frequentist) reading. Once the interval is computed, the true mean is either in it or not — there's no probability left to assign, because nothing is random anymore. The 95% describes how often the method succeeds across many hypothetical samples, not the chance for your single realized interval. The parameter is fixed; the interval was the random thing, and you've already rolled the dice.
The canonical simulation: watch the coverage happen
Talk is cheap; let's see it. We'll play god: define a population with a known μ, then draw 100 separate samples, build a 95% CI from each, and color the ones that miss μ. About 5 of the 100 should fail — that's the 95% coverage made visible. This is the single most important picture on the page.
Every red interval was built by the exact same correct procedure as every green one — it just drew an unlucky sample whose mean landed far from μ. You can't look at one interval and know its color; that's precisely why you can't say "this interval has a 95% chance." The 95% is the fraction of green bars across the whole forest.
Say it the right way
Correct: "We are 95% confident the true mean is between 48 and 54," understood as the procedure that produced this range captures the truth 95% of the time. Also fine in plain English: "our best estimate is 51, give or take about 3." Avoid: "there's a 95% probability the mean is in this interval" — it sounds identical but means something the frequentist framework can't deliver.
You compute a 95% CI for average order value and get $(31, 37)$. Which statement is the technically correct interpretation?
There is a 95% probability that the true average order value is between $31 and $37
95% of all orders have a value between $31 and $37
If we repeated this sampling-and-interval procedure many times, about 95% of the resulting intervals would contain the true mean
We are 95% sure the sample mean is between $31 and $37
What a CI is not — four misreadings to kill
The interpretation trap has several flavors. Name each one so you can catch yourself.
- "There's a 95% probability μ is in this interval." Covered above — the parameter is fixed, so it's already in or out. The probability lived in the random interval, which is now decided.
- "95% of the data falls inside the interval." No. A CI for the
mean is built from the standard error (
s / √n), which shrinks asngrows. The range that holds ~95% of individual values is roughlyx̄ ± 2s(using the standard deviation) and does not shrink withn. Confusing these two is rampant — a 95% CI for the mean of a big sample can be razor-thin while the data themselves are spread all over. - "A wider interval is better/safer." Wider means less precise, not more trustworthy. You can always get a wider interval by raising the confidence level toward 100% — but a 99.99% interval so vague it spans "somewhere between $5 and $95" tells you nothing. Precision (narrow) and confidence (high) trade off; the art is balancing them.
- "The CI includes 0, so there's definitely no effect." A CI that includes 0 (for a difference) or 1 (for a ratio) means you can't rule out no effect — it's consistent with zero, but also consistent with a meaningful effect at the interval's far end. "Can't rule out" is not "proven absent." We'll connect this directly to Hypothesis Testing.
CI for the MEAN vs. range of the DATA
This is worth isolating because it bites experienced analysts. The
confidence interval for the mean uses the standard error
s / √n and gets narrower with more data — it's about pinning
down μ. The spread of the data uses the standard deviation
s and does not shrink with n — it's about how individual values
scatter. A CI of (49.8, 50.2) on 100,000 points does not mean the
data lives in that sliver; it means you know the average very
precisely. Always ask: "a range for the parameter, or for the values?"
What controls the width
The margin of error is critical value × standard error, so the width
responds to exactly three levers:
- Confidence level ↑ → wider. Demanding 99% coverage instead of 95% needs a bigger critical value, so the band grows. More confidence costs precision.
- Sample size
n↑ → narrower. The standard error iss / √n, so width shrinks like1 / √n— quadruple the data to halve the margin. (The same diminishing return from Standard Error.) - Variability
s↑ → wider. Noisier data gives a noisier estimate. You don't control this directly, but cleaner measurement helps.
Let's watch the first two levers move.
The two blocks make the trade-off concrete: you buy precision (a
narrower interval) either by accepting less confidence or by collecting
more data — and the √n rule means the second option gets
expensive fast.
Your 95% CI for a conversion-rate lift is too wide to be useful. Which change will narrow it without lowering your confidence level?
Switch from a 95% to a 99% confidence level
Collect a substantially larger sample
Report the interval to more decimal places
Remove the most extreme data points to reduce the spread
A CI for a proportion
Means aren't the only thing you estimate. Conversion rates, churn
rates, click-through rates, "yes" shares in a survey — these are all
proportions, and they get confidence intervals too. The point
estimate is p̂ (successes over trials), and the standard error
of a proportion is √( p̂(1 − p̂) / n ). The classic
"normal-approximation" (Wald) interval is p̂ ± z × √( p̂(1 − p̂) / n ).
The Wald interval is shaky near 0%, near 100%, or for small n
The normal-approximation formula above is fine for a healthy sample
away from the extremes, but it behaves badly when p̂ is close to
0 or 1, or when n is small (it can even produce a lower bound below 0).
For those cases use a better interval — the Wilson or
Clopper-Pearson methods. In SciPy, stats.binomtest(successes, n).proportion_ci()
gives a robust interval (Clopper-Pearson by default, Wilson on request).
When in doubt, prefer those over the hand-rolled Wald formula.
Practice building intervals
A sample of sensor readings has been created for you. Compute a 95% confidence interval for the population mean using the t-distribution, and store the two endpoints in a tuple named ci as (low, high).
Steps:
- Point estimate: the sample mean.
- Standard error: sample sd with ddof=1, divided by
sqrt(n). - Use
scipy.stats.t.interval(0.95, df=n-1, loc=mean, scale=se).
ci must be a tuple of two floats with ci[0] < ci[1].
An email campaign got successes opens out of n sends (both given). Build a 95% normal-approximation (Wald) confidence interval for the true open-rate proportion and store it as a tuple ci = (low, high).
- Point estimate
p_hat = successes / n. - Standard error
sqrt(p_hat * (1 - p_hat) / n). - Critical value
stats.norm.ppf(0.975)(about 1.96). ci = (p_hat - z*se, p_hat + z*se), both endpoints as floats.
Show the precision payoff of more data. For each sample size in sizes = [25, 100, 400], a sample is drawn for you from the same population. Compute the width (high minus low) of the 95% t-interval for the mean at each size, and store them in order in a list named widths (three floats).
Width = high - low from stats.t.interval(0.95, df=n-1, loc=mean, scale=se), with se = sd(ddof=1)/sqrt(n).
Because width scales like 1/sqrt(n), going 25 -> 100 (x4) should roughly halve it, and 100 -> 400 (x4) should roughly halve it again.
CIs and hypothesis tests are two views of one thing
A confidence interval and a hypothesis test are deeply linked — they're the same information in two outfits. A 95% CI for a difference contains exactly the values you would not reject at the 5% level. So a quick test: does the interval include the no-effect value?
- A 95% CI for a difference that excludes 0 corresponds to a statistically significant result (p < 0.05) — you'd reject "no difference."
- A 95% CI that includes 0 means you cannot rule out zero at the 5% level — but, crucially, that's "not proven different," not "proven the same." The interval also includes nonzero values you can't rule out either.
This duality is why many statisticians prefer reporting a CI over a bare p-value: the interval tells you the direction, the plausible magnitude, and the significance all at once. We develop the test side fully in Hypothesis Testing and p-values.
'The CI includes 0' is not 'no effect'
A difference CI of (-0.5, 8.0) includes 0, so you can't declare a significant effect — but it also includes +8, a potentially large effect you equally can't rule out. The honest reading is "the data are consistent with anything from a small decrease to a sizable increase; we need more data." Absence of evidence (a wide interval straddling 0) is not evidence of absence (a tight interval hugging 0).
Check your understanding
A pollster reports "52% support, 95% CI 49% to 55%." A reader says "so there's a 95% chance the true support is between 49% and 55%." Why is this not the technically correct frequentist interpretation?
Because the sample was too small to make any probability claim
Because the true support is a fixed number that is either inside or outside this particular interval; the 95% describes the long-run hit rate of the procedure, not the chance for this one interval
Because confidence intervals only apply to means, never to proportions
Because the interval should have been centered on 50%, not 52%
You build a 95% CI for the mean weekly spend from 50,000 customers and get $(61.40, 61.80)$ — a very narrow interval. A colleague concludes "so 95% of customers spend between $61.40 and $61.80." What's wrong?
Nothing is wrong; that's exactly what the interval says
The interval is too narrow to be trustworthy
The CI bounds the mean, using the standard error (); individual spending varies far more and is described by the standard deviation, not this interval
The colleague should have used a 99% interval instead
Why does a 99% confidence interval come out wider than a 95% one built from the same sample?
Because a 99% interval uses a smaller standard error
Because catching the parameter a larger fraction of the time requires a larger critical value, which multiplies the same standard error into a wider band
Because 99% intervals automatically use a larger sample
Because the true mean moves further away at 99% confidence
In the coverage simulation, 100 separate 95% CIs were built from a known population and about 5 missed the true . What do the ~5 misses represent?
A bug in the interval formula that should be fixed
Samples where the population mean happened to change
Unlucky samples whose intervals failed to capture the fixed — exactly the ~5% the 95% confidence level allows
Proof that 95% confidence intervals are unreliable
Which scenario calls for the t-distribution rather than the normal (z) when building a CI for a mean?
Only when the sample size is above 1,000
When the population standard deviation is unknown and estimated from the sample — which is essentially always in practice
Only when the data are not normally distributed
Never; the normal distribution is always correct for means
An A/B test gives a 95% CI for the lift in conversion of . Your PM asks: "Did the treatment work?" What's the most accurate answer?
Yes — the interval reaches +3.1%, so the treatment increased conversion
No — the treatment had no effect
We can't conclude a significant effect: the interval includes 0, so the data are consistent with anything from a slight decrease to a sizable increase — we likely need more data
The test is invalid because the lower bound is negative
Which action genuinely narrows a confidence interval for a mean while keeping its confidence level and validity intact?
Lowering the confidence level from 95% to 90%
Dropping the largest and smallest observations to shrink the spread
Increasing the sample size
Reporting the bounds with fewer significant figures
Key takeaways
- A confidence interval is point estimate ± margin of error, where margin = critical value × standard error. It expresses how precise your estimate is.
- "95% confidence" is a property of the procedure: repeat the sampling-and-interval process and ~95% of the intervals cover the true parameter. It is not "a 95% probability the parameter is in this one interval" — the parameter is fixed, the interval was random.
- A CI for the mean (uses standard error, shrinks with
n) is not a range for the data (uses standard deviation, doesn't shrink). - Width grows with the confidence level and the variability, and shrinks like
1 / √n. Wider is less precise, not "safer." - Use
stats.t.intervalfor means (t-distribution,df = n − 1) and a proportion formula (orstats.binomtest(...).proportion_ci()) for rates. - A difference CI that includes 0 means inconclusive, not no effect — this is the bridge to Hypothesis Testing.
When a parameter has no clean standard-error formula — a median, a trimmed mean, a ratio, a correlation — you can still get a confidence interval by resampling your data. That remarkably general tool is next, in The Bootstrap.
Standard Error
The standard error is the standard deviation of a statistic's sampling distribution — the spread of your estimate, not your data. The most-confused pair in statistics (SD vs SE), the square-root-of-n law, and why precision has diminishing returns.
The Bootstrap
Resampling your own data with replacement to approximate a sampling distribution — getting standard errors and confidence intervals for awkward statistics like the median, with no formulas and no normality assumptions.