Confidence Intervals

What a confidence interval really is — a point estimate plus a margin of error that expresses precision — and the one interpretation almost everyone gets wrong, built from a coverage simulation you can run yourself.

A single number is a confident-sounding lie. When you report "average revenue per user is $48.20," you've collapsed a noisy estimate into a point that looks exact — but a different sample would have given $46.10 or $50.05. In Populations and Samples we named that wobble estimation error, and in Standard Error we measured its size. A confidence interval is what you report instead of the bare point: a range that carries the uncertainty with it, so your reader knows how precise the estimate actually is.

The shape is always the same — point estimate ± margin of error — and it answers a specific question: given the noise in my sample, what range of values for the true parameter is plausible? This page builds that intuition, then spends most of its energy on the single most misunderstood sentence in all of applied statistics: what "95% confidence" actually means.

A point estimate alone hides its own uncertainty

You sample 50 users, the mean is $48.20. That's your point estimate — your single best guess for the population mean μ. But you already know x̄ is a random variable: it has a standard error, a typical sample-to-sample wobble. The point estimate throws that information away. The confidence interval keeps it.

Read the diagram left to right: take your estimate, attach a margin of error built from the standard error, and you get a band of plausible values for the parameter. A wide band says "I'm not sure"; a narrow band says "I've pinned this down." The interval turns a single number into an honest number.

The anatomy of every confidence interval

Every CI you will ever build has three ingredients: point estimate (where you think the parameter is), standard error (how much your estimate wobbles), and a critical value (how many standard errors wide to make the band, set by the confidence level). Combine them: estimate ± (critical value × standard error). Memorize the shape, not any single formula — the shape is universal.

Building a CI for a mean

For a population mean, the point estimate is x̄ and the standard error is s / √n (sample standard deviation over root-n). Because we estimate σ from the data, the right critical value comes from the t-distribution with n − 1 degrees of freedom, not the normal — the t is slightly wider to pay for that extra uncertainty, and the difference matters most when n is small.

scipy.stats does the arithmetic for you. stats.t.interval takes the confidence level, the degrees of freedom, the center, and the standard error, and hands back the two endpoints.

The margin of error is half the width — the ± part. Notice it is built entirely from things you can compute from the sample (s, n) and the confidence level (the critical value). If you want the critical value explicitly, it's stats.t.ppf(0.975, df=n-1) for a 95% interval (0.975 because 2.5% is left in each tail), and the CI is x_bar ± t_crit * se. The two approaches give the identical interval.

Why t and not z (normal)?

You use the t-distribution whenever you estimate the standard deviation from the same sample — which is essentially always in practice. The t has heavier tails to account for that extra guesswork, so its intervals are a touch wider, especially for small n. By the time n is in the hundreds, t and normal are nearly identical, but reaching for stats.t.interval is the safe default that's never wrong.

What "95% confidence" really means

Here is the sentence almost everyone gets wrong. A 95% confidence level is a property of the procedure, not of any single interval you computed.

If you repeated the entire process — draw a fresh sample, build a 95% interval from it — over and over, about 95% of those intervals would contain the true parameter.

The confidence is in the recipe's long-run hit rate, not in your one particular interval. Your specific interval either contains μ or it doesn't — you just don't know which, because μ is fixed and invisible. The randomness lives in the interval (it jumps around with your sample), not in the parameter (it sits still).

The diagram is the whole idea: μ never moves, but each sample produces a different interval. Most cover the truth; an unlucky ~5% miss. "95% confidence" is a promise about that picture — about the collection of intervals the procedure would generate — not a probability statement about the one interval in your notebook.

The #1 misinterpretation (read this twice)

"There is a 95% probability that the true mean lies in this interval (48.2, 53.9)" is wrong under the standard (frequentist) reading. Once the interval is computed, the true mean is either in it or not — there's no probability left to assign, because nothing is random anymore. The 95% describes how often the method succeeds across many hypothetical samples, not the chance for your single realized interval. The parameter is fixed; the interval was the random thing, and you've already rolled the dice.

The canonical simulation: watch the coverage happen

Talk is cheap; let's see it. We'll play god: define a population with a known μ, then draw 100 separate samples, build a 95% CI from each, and color the ones that miss μ. About 5 of the 100 should fail — that's the 95% coverage made visible. This is the single most important picture on the page.

Every red interval was built by the exact same correct procedure as every green one — it just drew an unlucky sample whose mean landed far from μ. You can't look at one interval and know its color; that's precisely why you can't say "this interval has a 95% chance." The 95% is the fraction of green bars across the whole forest.

Say it the right way

Correct: "We are 95% confident the true mean is between 48 and 54," understood as the procedure that produced this range captures the truth 95% of the time. Also fine in plain English: "our best estimate is 51, give or take about 3." Avoid: "there's a 95% probability the mean is in this interval" — it sounds identical but means something the frequentist framework can't deliver.

QuestionSelect one

You compute a 95% CI for average order value and get $(31, 37)$. Which statement is the technically correct interpretation?

There is a 95% probability that the true average order value is between $31 and $37

95% of all orders have a value between $31 and $37

If we repeated this sampling-and-interval procedure many times, about 95% of the resulting intervals would contain the true mean

We are 95% sure the sample mean is between $31 and $37

What a CI is not — four misreadings to kill

The interpretation trap has several flavors. Name each one so you can catch yourself.

"There's a 95% probability μ is in this interval." Covered above — the parameter is fixed, so it's already in or out. The probability lived in the random interval, which is now decided.
"95% of the data falls inside the interval." No. A CI for the mean is built from the standard error (s / √n), which shrinks as n grows. The range that holds ~95% of individual values is roughly x̄ ± 2s (using the standard deviation) and does not shrink with n. Confusing these two is rampant — a 95% CI for the mean of a big sample can be razor-thin while the data themselves are spread all over.
"A wider interval is better/safer." Wider means less precise, not more trustworthy. You can always get a wider interval by raising the confidence level toward 100% — but a 99.99% interval so vague it spans "somewhere between $5 and $95" tells you nothing. Precision (narrow) and confidence (high) trade off; the art is balancing them.
"The CI includes 0, so there's definitely no effect." A CI that includes 0 (for a difference) or 1 (for a ratio) means you can't rule out no effect — it's consistent with zero, but also consistent with a meaningful effect at the interval's far end. "Can't rule out" is not "proven absent." We'll connect this directly to Hypothesis Testing.

CI for the MEAN vs. range of the DATA

This is worth isolating because it bites experienced analysts. The confidence interval for the mean uses the standard error s / √n and gets narrower with more data — it's about pinning down μ. The spread of the data uses the standard deviation s and does not shrink with n — it's about how individual values scatter. A CI of (49.8, 50.2) on 100,000 points does not mean the data lives in that sliver; it means you know the average very precisely. Always ask: "a range for the parameter, or for the values?"

What controls the width

The margin of error is critical value × standard error, so the width responds to exactly three levers:

Confidence level ↑ → wider. Demanding 99% coverage instead of 95% needs a bigger critical value, so the band grows. More confidence costs precision.
Sample size n ↑ → narrower. The standard error is s / √n, so width shrinks like 1 / √n — quadruple the data to halve the margin. (The same diminishing return from Standard Error.)
Variability s ↑ → wider. Noisier data gives a noisier estimate. You don't control this directly, but cleaner measurement helps.

Let's watch the first two levers move.

The two blocks make the trade-off concrete: you buy precision (a narrower interval) either by accepting less confidence or by collecting more data — and the √n rule means the second option gets expensive fast.

QuestionSelect one

Your 95% CI for a conversion-rate lift is too wide to be useful. Which change will narrow it without lowering your confidence level?

Switch from a 95% to a 99% confidence level

Collect a substantially larger sample

Report the interval to more decimal places

Remove the most extreme data points to reduce the spread

A CI for a proportion

Means aren't the only thing you estimate. Conversion rates, churn rates, click-through rates, "yes" shares in a survey — these are all proportions, and they get confidence intervals too. The point estimate is p̂ (successes over trials), and the standard error of a proportion is √( p̂(1 − p̂) / n ). The classic "normal-approximation" (Wald) interval is p̂ ± z × √( p̂(1 − p̂) / n ).

The Wald interval is shaky near 0%, near 100%, or for small n

The normal-approximation formula above is fine for a healthy sample away from the extremes, but it behaves badly when p̂ is close to 0 or 1, or when n is small (it can even produce a lower bound below 0). For those cases use a better interval — the Wilson or Clopper-Pearson methods. In SciPy, stats.binomtest(successes, n).proportion_ci() gives a robust interval (Clopper-Pearson by default, Wilson on request). When in doubt, prefer those over the hand-rolled Wald formula.

Practice building intervals

A sample of sensor readings has been created for you. Compute a 95% confidence interval for the population mean using the t-distribution, and store the two endpoints in a tuple named ci as (low, high).

Steps:

Point estimate: the sample mean.
Standard error: sample sd with ddof=1, divided by sqrt(n).
Use scipy.stats.t.interval(0.95, df=n-1, loc=mean, scale=se).

ci must be a tuple of two floats with ci[0] < ci[1].

An email campaign got successes opens out of n sends (both given). Build a 95% normal-approximation (Wald) confidence interval for the true open-rate proportion and store it as a tuple ci = (low, high).

Point estimate p_hat = successes / n.
Standard error sqrt(p_hat * (1 - p_hat) / n).
Critical value stats.norm.ppf(0.975) (about 1.96).
ci = (p_hat - z*se, p_hat + z*se), both endpoints as floats.

Show the precision payoff of more data. For each sample size in sizes = [25, 100, 400], a sample is drawn for you from the same population. Compute the width (high minus low) of the 95% t-interval for the mean at each size, and store them in order in a list named widths (three floats).

Width = high - low from stats.t.interval(0.95, df=n-1, loc=mean, scale=se), with se = sd(ddof=1)/sqrt(n).

Because width scales like 1/sqrt(n), going 25 -> 100 (x4) should roughly halve it, and 100 -> 400 (x4) should roughly halve it again.

CIs and hypothesis tests are two views of one thing

A confidence interval and a hypothesis test are deeply linked — they're the same information in two outfits. A 95% CI for a difference contains exactly the values you would not reject at the 5% level. So a quick test: does the interval include the no-effect value?

A 95% CI for a difference that excludes 0 corresponds to a statistically significant result (p < 0.05) — you'd reject "no difference."
A 95% CI that includes 0 means you cannot rule out zero at the 5% level — but, crucially, that's "not proven different," not "proven the same." The interval also includes nonzero values you can't rule out either.

This duality is why many statisticians prefer reporting a CI over a bare p-value: the interval tells you the direction, the plausible magnitude, and the significance all at once. We develop the test side fully in Hypothesis Testing and p-values.

'The CI includes 0' is not 'no effect'

A difference CI of (-0.5, 8.0) includes 0, so you can't declare a significant effect — but it also includes +8, a potentially large effect you equally can't rule out. The honest reading is "the data are consistent with anything from a small decrease to a sizable increase; we need more data." Absence of evidence (a wide interval straddling 0) is not evidence of absence (a tight interval hugging 0).

Check your understanding

QuestionSelect one

A pollster reports "52% support, 95% CI 49% to 55%." A reader says "so there's a 95% chance the true support is between 49% and 55%." Why is this not the technically correct frequentist interpretation?

Because the sample was too small to make any probability claim

Because the true support is a fixed number that is either inside or outside this particular interval; the 95% describes the long-run hit rate of the procedure, not the chance for this one interval

Because confidence intervals only apply to means, never to proportions

Because the interval should have been centered on 50%, not 52%

QuestionSelect one

You build a 95% CI for the mean weekly spend from 50,000 customers and get $(61.40, 61.80)$ — a very narrow interval. A colleague concludes "so 95% of customers spend between $61.40 and $61.80." What's wrong?

Nothing is wrong; that's exactly what the interval says

The interval is too narrow to be trustworthy

The CI bounds the mean, using the standard error ( $s/\sqrt{n}$ ); individual spending varies far more and is described by the standard deviation, not this interval

The colleague should have used a 99% interval instead

QuestionSelect one

Why does a 99% confidence interval come out wider than a 95% one built from the same sample?

Because a 99% interval uses a smaller standard error

Because catching the parameter a larger fraction of the time requires a larger critical value, which multiplies the same standard error into a wider band

Because 99% intervals automatically use a larger sample

Because the true mean moves further away at 99% confidence

QuestionSelect one

In the coverage simulation, 100 separate 95% CIs were built from a known population and about 5 missed the true $\mu$ . What do the ~5 misses represent?

A bug in the interval formula that should be fixed

Samples where the population mean happened to change

Unlucky samples whose intervals failed to capture the fixed $\mu$ — exactly the ~5% the 95% confidence level allows

Proof that 95% confidence intervals are unreliable

QuestionSelect one

Which scenario calls for the t-distribution rather than the normal (z) when building a CI for a mean?

Only when the sample size is above 1,000

When the population standard deviation is unknown and estimated from the sample — which is essentially always in practice

Only when the data are not normally distributed

Never; the normal distribution is always correct for means

QuestionSelect one

An A/B test gives a 95% CI for the lift in conversion of $(-0.4\%, +3.1\%)$ . Your PM asks: "Did the treatment work?" What's the most accurate answer?

Yes — the interval reaches +3.1%, so the treatment increased conversion

No — the treatment had no effect

We can't conclude a significant effect: the interval includes 0, so the data are consistent with anything from a slight decrease to a sizable increase — we likely need more data

The test is invalid because the lower bound is negative

QuestionSelect one

Which action genuinely narrows a confidence interval for a mean while keeping its confidence level and validity intact?

Lowering the confidence level from 95% to 90%

Dropping the largest and smallest observations to shrink the spread

Increasing the sample size $n$

Reporting the bounds with fewer significant figures

Key takeaways

A confidence interval is point estimate ± margin of error, where margin = critical value × standard error. It expresses how precise your estimate is.
"95% confidence" is a property of the procedure: repeat the sampling-and-interval process and ~95% of the intervals cover the true parameter. It is not "a 95% probability the parameter is in this one interval" — the parameter is fixed, the interval was random.
A CI for the mean (uses standard error, shrinks with n) is not a range for the data (uses standard deviation, doesn't shrink).
Width grows with the confidence level and the variability, and shrinks like 1 / √n. Wider is less precise, not "safer."
Use stats.t.interval for means (t-distribution, df = n − 1) and a proportion formula (or stats.binomtest(...).proportion_ci()) for rates.
A difference CI that includes 0 means inconclusive, not no effect — this is the bridge to Hypothesis Testing.

When a parameter has no clean standard-error formula — a median, a trimmed mean, a ratio, a correlation — you can still get a confidence interval by resampling your data. That remarkably general tool is next, in The Bootstrap.

Confidence Intervals

On this page