Intuition for Inference
Confidence intervals and p-values are the lingua franca of applied statistics — and the most misinterpreted ideas in all of science. Let's build correct intuition for what they really mean.
You've now seen that any statistic computed from a sample wiggles. Inference is the art of saying useful things about a population despite that wiggling.
We'll cover three workhorse ideas:
- Confidence intervals — a range of plausible values for an unknown parameter.
- p-values — how surprising your data would be under a skeptical "no effect" assumption.
- Effect size — how big a difference is, separate from how confident we are that it exists.
We'll do all of them in R using built-in functions like
t.test(). The math is secondary; the meaning is the goal.
A confidence interval, plainly
A 95% confidence interval for a parameter is a range constructed from your data such that, across many hypothetical repetitions of the study, 95% of intervals constructed this way would contain the true parameter.
Read that carefully. It is not "there's a 95% chance the true value is in this specific interval." The true value is fixed; the interval is what wiggles from sample to sample.
The intuition is easier to see in a simulation:
Run it a few times. You'll see the fraction of CIs containing the truth hovers around 0.95. That's what "95% confidence" means. Each individual CI either covers the truth or doesn't — we just don't know which.
Visualize it:
Most lines cross the red truth-line. A few miss. That's sampling variability made visible.
A p-value, plainly
A p-value is the probability, if there were no real effect, of seeing a result as extreme as the one you got — or more extreme.
Small p-value → your observed result is surprising under "no effect" → you have evidence of some effect.
Large p-value → your observed result is unsurprising under "no effect" → you have no strong evidence of an effect (but also no strong evidence against one — absence of evidence is not evidence of absence).
Let's run a quick test:
The output gives you:
t: a standardized measure of how far apart the groups look relative to the noise.df: degrees of freedom (sample-size-related).p-value: how surprising this gap (or bigger) would be if the two populations actually had identical means.95 percent confidence interval: a plausible range for the true difference in means.
If p < 0.05, by convention we say the result is "statistically
significant." That convention is useful — but also widely
abused.
Three things p-values do NOT mean
- NOT "the probability the null hypothesis is true."
- NOT "the probability your result was a fluke."
- NOT "the probability your finding will replicate."
A p-value answers exactly one question: given a specific skeptical assumption (the null), how surprising is what we saw? That's it.
Effect size: significance ≠ importance
Two studies can both have p < 0.001 and tell wildly different
stories:
A and B might both look "highly significant" by p-value, but the effect sizes differ by ~100×. A is a real but trivial difference; B is a real and large one. Always report — and think about — the effect, not just the p.
A complete inference workflow
Putting the pieces together: load data, summarize, plot, test, interpret.
A grounded reading of that output:
- The boxplot suggests group 2 trends higher on average.
- The estimated difference is reported, with a CI.
- If
p < 0.05, we have decent evidence the two drugs differ in their effect — but always re-check the CI to see by how much.
That last step — looking at the CI, not only the p-value — is the single best habit you can develop. The CI tells you both direction and magnitude of the difference and how much it could plausibly be larger or smaller.
Common pitfalls
- p-hacking. Trying many tests and reporting only the "significant" ones. With 20 random tests under no-effect, you expect about 1 to land below p=0.05 by chance.
- Overinterpreting non-significance. "p = 0.07" is not "no effect" — it's "we couldn't reliably distinguish it from zero with this sample size."
- Ignoring sample size. With huge n, any tiny gap becomes significant. With small n, even large real effects can be missed.
- Confusing CI with prediction interval. A CI is about the parameter, not about individual future values.
Test your understanding
"A 95% confidence interval for the mean is [45, 55]." The most accurate reading is:
Hint: the true mean is a fixed (if unknown) number; what changes from study to study is the interval you compute.
There is a 95% chance the true mean is between 45 and 55.
95% of the data falls between 45 and 55.
If we repeated the study many times and built CIs the same way each time, about 95% of those intervals would contain the true mean.
The mean is exactly 50.
A study reports p = 0.001 for a difference in means. Which is correct?
Hint: a p-value assumes the null hypothesis is true and asks how surprising the data would be under it — not how likely the null itself is.
The probability the null hypothesis is true is 0.001.
The probability the finding is a fluke is 0.001.
If there were no real difference between the groups, the chance of seeing a difference at least this extreme would be about 0.001.
The difference is definitely large.
Why is it important to report effect size in addition to a p-value?
p-values are unreliable.
Effect size determines the p-value entirely.
A result can be statistically significant (especially with large n) yet practically meaningless — effect size tells you whether it matters in the real world.
Confidence intervals are not used in practice.
Mini challenge: full t-test workflow
Using the built-in mtcars dataset, test whether cars with
automatic transmission (am == 0) and manual transmission
(am == 1) have different mean miles-per-gallon. Save the
output of t.test() to a variable tt, and pull out:
est_diff— the difference of sample means (manual − auto)ci— the 95% confidence interval for that difference (length-2 numeric)p— the p-value
Run t.test(mpg ~ am, data = mtcars) and from its returned object set: tt to the test result, est_diff to mean(manual) - mean(automatic) (note: am = 1 is manual, am = 0 is automatic), ci to the confidence interval as a length-2 numeric, and p to the p-value.
You now understand the conceptual core of statistical inference: sampling, uncertainty, intervals, and tests. The next two pages shift back into the craft of writing analysis code that's clean, reusable, and reproducible.
Sampling and Distributions
Almost every dataset is a sample drawn from some bigger population. Understanding sampling — and the surprisingly orderly behavior of sample averages — turns raw data into evidence.
Writing Your Own Functions
Functions are how analysis code stays understandable as it grows. Learn to write small, well-named functions that capture intent instead of copy-pasting logic.