Dataslope logoDataslope

Intuition for Inference

Confidence intervals and p-values are the lingua franca of applied statistics — and the most misinterpreted ideas in all of science. Let's build correct intuition for what they really mean.

You've now seen that any statistic computed from a sample wiggles. Inference is the art of saying useful things about a population despite that wiggling.

We'll cover three workhorse ideas:

  1. Confidence intervals — a range of plausible values for an unknown parameter.
  2. p-values — how surprising your data would be under a skeptical "no effect" assumption.
  3. Effect sizehow big a difference is, separate from how confident we are that it exists.

We'll do all of them in R using built-in functions like t.test(). The math is secondary; the meaning is the goal.

A confidence interval, plainly

A 95% confidence interval for a parameter is a range constructed from your data such that, across many hypothetical repetitions of the study, 95% of intervals constructed this way would contain the true parameter.

Read that carefully. It is not "there's a 95% chance the true value is in this specific interval." The true value is fixed; the interval is what wiggles from sample to sample.

The intuition is easier to see in a simulation:

Code Block
R 4.6.0

Run it a few times. You'll see the fraction of CIs containing the truth hovers around 0.95. That's what "95% confidence" means. Each individual CI either covers the truth or doesn't — we just don't know which.

Visualize it:

Code Block
R 4.6.0

Most lines cross the red truth-line. A few miss. That's sampling variability made visible.

A p-value, plainly

A p-value is the probability, if there were no real effect, of seeing a result as extreme as the one you got — or more extreme.

Small p-value → your observed result is surprising under "no effect" → you have evidence of some effect.

Large p-value → your observed result is unsurprising under "no effect" → you have no strong evidence of an effect (but also no strong evidence against one — absence of evidence is not evidence of absence).

Let's run a quick test:

Code Block
R 4.6.0

The output gives you:

  • t: a standardized measure of how far apart the groups look relative to the noise.
  • df: degrees of freedom (sample-size-related).
  • p-value: how surprising this gap (or bigger) would be if the two populations actually had identical means.
  • 95 percent confidence interval: a plausible range for the true difference in means.

If p < 0.05, by convention we say the result is "statistically significant." That convention is useful — but also widely abused.

Three things p-values do NOT mean

  • NOT "the probability the null hypothesis is true."
  • NOT "the probability your result was a fluke."
  • NOT "the probability your finding will replicate."

A p-value answers exactly one question: given a specific skeptical assumption (the null), how surprising is what we saw? That's it.

Effect size: significance ≠ importance

Two studies can both have p < 0.001 and tell wildly different stories:

Code Block
R 4.6.0

A and B might both look "highly significant" by p-value, but the effect sizes differ by ~100×. A is a real but trivial difference; B is a real and large one. Always report — and think about — the effect, not just the p.

A complete inference workflow

Putting the pieces together: load data, summarize, plot, test, interpret.

Code Block
R 4.6.0

A grounded reading of that output:

  1. The boxplot suggests group 2 trends higher on average.
  2. The estimated difference is reported, with a CI.
  3. If p < 0.05, we have decent evidence the two drugs differ in their effect — but always re-check the CI to see by how much.

That last step — looking at the CI, not only the p-value — is the single best habit you can develop. The CI tells you both direction and magnitude of the difference and how much it could plausibly be larger or smaller.

Common pitfalls

  • p-hacking. Trying many tests and reporting only the "significant" ones. With 20 random tests under no-effect, you expect about 1 to land below p=0.05 by chance.
  • Overinterpreting non-significance. "p = 0.07" is not "no effect" — it's "we couldn't reliably distinguish it from zero with this sample size."
  • Ignoring sample size. With huge n, any tiny gap becomes significant. With small n, even large real effects can be missed.
  • Confusing CI with prediction interval. A CI is about the parameter, not about individual future values.

Test your understanding

QuestionSelect one

"A 95% confidence interval for the mean is [45, 55]." The most accurate reading is:

Hint: the true mean is a fixed (if unknown) number; what changes from study to study is the interval you compute.

There is a 95% chance the true mean is between 45 and 55.

95% of the data falls between 45 and 55.

If we repeated the study many times and built CIs the same way each time, about 95% of those intervals would contain the true mean.

The mean is exactly 50.

QuestionSelect one

A study reports p = 0.001 for a difference in means. Which is correct?

Hint: a p-value assumes the null hypothesis is true and asks how surprising the data would be under it — not how likely the null itself is.

The probability the null hypothesis is true is 0.001.

The probability the finding is a fluke is 0.001.

If there were no real difference between the groups, the chance of seeing a difference at least this extreme would be about 0.001.

The difference is definitely large.

QuestionSelect one

Why is it important to report effect size in addition to a p-value?

p-values are unreliable.

Effect size determines the p-value entirely.

A result can be statistically significant (especially with large n) yet practically meaningless — effect size tells you whether it matters in the real world.

Confidence intervals are not used in practice.

Mini challenge: full t-test workflow

Using the built-in mtcars dataset, test whether cars with automatic transmission (am == 0) and manual transmission (am == 1) have different mean miles-per-gallon. Save the output of t.test() to a variable tt, and pull out:

  • est_diff — the difference of sample means (manual − auto)
  • ci — the 95% confidence interval for that difference (length-2 numeric)
  • p — the p-value
Challenge
R 4.6.0
t-test workflow on mtcars

Run t.test(mpg ~ am, data = mtcars) and from its returned object set: tt to the test result, est_diff to mean(manual) - mean(automatic) (note: am = 1 is manual, am = 0 is automatic), ci to the confidence interval as a length-2 numeric, and p to the p-value.

You now understand the conceptual core of statistical inference: sampling, uncertainty, intervals, and tests. The next two pages shift back into the craft of writing analysis code that's clean, reusable, and reproducible.

On this page