t-Tests
How to compare means under uncertainty with one-sample, two-sample (Welch), and paired t-tests — the t-statistic as signal divided by noise, the assumptions that matter, and how to interpret t, p, an interval, and an effect size.
You already know the big idea from Hypothesis Testing: a difference between two averages might be real, or it might be the kind of wiggle random sampling hands you for free. The t-test is the workhorse for deciding which — specifically, for questions about means.
Three flavors cover almost everything you'll meet as a data scientist:
- One-sample — is this group's average different from a fixed target? ("Do our servers really respond in 200 ms on average?")
- Two-sample (independent) — do two separate groups have different averages? ("Did the new onboarding flow change average revenue per user?")
- Paired — did the same units change between two conditions? ("Did each patient's blood pressure drop after the drug?")
This page is about picking the right one, running it with scipy.stats,
and — the part that actually matters — reading the result like a careful
analyst instead of a p-value vending machine.
The t-statistic is signal divided by noise
Every t-test boils your data down to a single number, the t-statistic. Strip away the bookkeeping and it's just a ratio:
t = (difference you observed) / (noise in that difference)
The numerator is the signal — how far the means are apart. The denominator is the standard error, the typical size of the random wobble you'd expect even if nothing were going on (you met the standard error in Standard Error). So:
- Big |t| → the gap dwarfs the noise → surprising if there were no real difference.
- Small |t| → the gap is the size of ordinary noise → nothing to see.
The t-distribution then translates that ratio into a p-value: if there were truly no difference, how often would noise alone produce a |t| at least this big? Small p-value, surprising-under-the-null, you reject. That's the whole machine — the three tests below differ only in what difference goes on top and what noise goes on the bottom.
Why 't' and not just a z-score?
When you don't know the true population spread (you almost never do) you estimate it from the sample. That estimate is itself noisy, especially for small samples, so the t-distribution has slightly fatter tails than the normal to stay honest about that extra uncertainty. With large samples the t-distribution is indistinguishable from the normal — the fat tails only matter when n is small.
Which t-test? A decision flowchart
Before any code, answer one question: what is the structure of your data? Getting this wrong is the single most common t-test mistake.
We'll take them one at a time.
One-sample t-test: a mean versus a target
The question it answers: is the true average of one group different from some fixed value you care about?
A cloud team promises a 200 ms average API response time. You pull a random sample of 50 real requests. The sample averages a bit above 200 — but is that a genuine breach of the promise, or just a slow afternoon?
How to read it. A positive t_stat means the sample mean sits above
the target; its size says how many standard errors above. The p-value is
the probability of a sample mean at least this far from 200 if the true
average really were 200. Below 0.05 and you doubt the 200 ms claim; above
and you can't distinguish the data from that claim.
But "significant" doesn't tell you how far off the promise is. For that, pair the test with a confidence interval for the mean.
Always report the interval, not just the p-value
A p-value answers "is it different from the target?" A confidence interval answers "different by how much, and how precisely do I know that?" The interval contains strictly more information — it even tells you the test's verdict (reject exactly when the target falls outside the interval). We dig into intervals in Confidence Intervals and into "how big is the effect?" in Effect Sizes.
Two-sample t-test: comparing two independent groups
The question it answers: do two separate groups have different average values? The groups contain different units — different users, different customers, different machines — with no natural pairing between them.
This is the engine of most A/B tests. Group A saw the old page, group B saw the new one; did average revenue per user differ?
Default to Welch's t-test (equal_var=False)
There are two versions of the two-sample t-test:
- Student's t-test assumes the two groups have the same variance.
- Welch's t-test does not — it allows the groups to have different spreads.
In real data the variances are basically never identical, and Welch's
test barely costs you anything when they happen to be equal while
protecting you when they aren't. So the modern recommendation is simple:
use Welch's by default. In scipy.stats that means passing
equal_var=False.
scipy's default is the OLD default — override it
stats.ttest_ind(a, b) defaults to equal_var=True (Student's), which
quietly assumes equal variances. Unless you have a strong reason to
believe the spreads are equal, pass equal_var=False to get Welch's
test. It's the safer choice and what most statisticians now recommend.
Notice the two p-values differ. When the spreads and group sizes are unequal, Student's test can be miscalibrated — its p-value is not quite the error rate it claims. Welch's stays honest. The cost of using Welch when variances really are equal is negligible, which is why it's the sensible default.
How to read the two-sample result. The sign of t tells you which
group is higher (here, with control first, a negative t means treatment
control). The p-value asks: if the two groups truly had the same mean, how often would noise produce a gap at least this big? Below
alpha, you conclude the means differ. As always, follow up with an interval and an effect size to learn how much.
Effect size, briefly
Cohen's d rescales the difference in means into units of standard deviation, so |d| ≈ 0.2 is small, 0.5 medium, 0.8 large — independent of your measurement units. A microscopic difference can be "significant" in a huge sample yet have a tiny d. We treat effect sizes properly in Effect Sizes; for now, just remember a p-value alone never tells you if a difference matters.
Paired t-test: the same units, measured twice
The question it answers: when you measure the same unit under two conditions, did it change?
This is the before/after design: the same patient's weight before and after a program, the same server's latency before and after a config change, the same user's spend last month and this month. The two columns line up row by row — each row is one unit.
Why pairing is a superpower
Here's the key insight. People differ enormously from one another. If you treat before/after as two independent groups, all that person-to-person variation lands in the denominator as noise and drowns out the change you care about. A paired test sidesteps it: it looks at each unit's own change (after − before), so every person serves as their own control. The big between-person variation cancels, the noise shrinks, and the test gains power to detect a real effect.
A paired t-test is literally a one-sample t-test on the column of differences, with the target being zero ("no change"). Let's prove that to ourselves.
Now watch the power gain. Same numbers, but analyzed the wrong way — as two independent groups — the person-to-person spread swamps the signal.
Misconception: pairing vs independence is interchangeable
It is not. Using an unpaired test on paired data discards the
matching and usually loses power (you may miss a real effect). Using a
paired test on data that isn't actually matched is worse — it's
simply invalid, because you'd be pairing rows that have no relationship.
The rule: if each row is one unit measured under both conditions, pair it
(ttest_rel). If the two groups are different units, don't
(ttest_ind).
The assumptions, and how much they matter
Every t-test rests on three assumptions. They are not equally fragile.
| Assumption | What it means | How fragile? |
|---|---|---|
| Independence | Observations don't influence each other (within a group / across pairs) | Critical. No test fixes broken independence. |
| Approximate normality | The sampling distribution of the mean is roughly normal | Forgiving, thanks to the CLT — large n rescues it. |
| Equal variances | The two groups have the same spread | Only for Student's. Use Welch's and stop worrying. |
The normality assumption is widely misunderstood
The t-test does not require your raw data to be normally distributed. It requires the sampling distribution of the mean to be approximately normal — and the central limit theorem (see Central Limit Theorem) makes that happen automatically as the sample grows, even for skewed data. The cell below shows it: clearly non-normal raw data, yet the one-sample t-test behaves correctly because n is large enough.
When normality DOES bite
The CLT needs enough data to work. With a small sample (say n < 15) that is heavily skewed or has extreme outliers, the t-test can be unreliable — the sampling distribution hasn't had a chance to become normal. That's exactly when you reach for a nonparametric alternative like the Mann–Whitney U or Wilcoxon signed-rank test, covered in Correlation and Nonparametric Tests.
When to use a t-test — and when not to
Reach for a t-test when:
- You're comparing means (averages), and
- The data is numeric/continuous, and
- The sample is reasonably sized or roughly symmetric, and
- You can match the design to the right flavor (one / two / paired).
Look elsewhere when:
- You're comparing 3 or more group means → one-way ANOVA (running many t-tests inflates false positives — see ANOVA and Chi-Square).
- Your outcome is categorical (counts, yes/no) → chi-square (ANOVA and Chi-Square), not a t-test.
- The sample is small and badly skewed / outlier-ridden, or the data is ordinal (ranks) → a nonparametric test (Correlation and Nonparametric Tests).
- You really want effect size or a range, not a yes/no → an interval and a d (Confidence Intervals, Effect Sizes).
A fitness study records each participant's resting heart rate before and after an 8-week program — the same people both times. Which t-test fits, and why?
A two-sample independent t-test, because there are two columns of numbers
A paired t-test, because each participant is measured under both conditions, so the before/after values are matched row by row
A one-sample t-test on the "after" column against 0
It doesn't matter; paired and unpaired give the same answer
Challenge 1 — Run the right two-sample test
An e-commerce team A/B tests a new product page. You have session durations (in seconds) for the control and treatment groups. The groups are different users (independent), and you should not assume their variances are equal.
- Run the appropriate two-sample, two-sided t-test that does not assume equal variances (Welch's test).
- Store the p-value as a float named
p_value. - Set a boolean
rejectto whether the result is significant atalpha = 0.05(reject whenp_value <= alpha).
Pick the right scipy.stats function and the right keyword argument — you should not compute anything by hand.
Challenge 2 — Decide from a paired test
A team measures page load time (in seconds) on the same 30 pages, before and after a caching change. Because it's the same pages both times, the data is paired.
- Run the correct paired, two-sided t-test on
beforevsafter. - Store the p-value as a float named
p_value. - Store the mean improvement (
before - after, so a positive number means it got faster) as a float namedmean_improvement. - Set a string
verdictto"faster"if the test is significant atalpha = 0.05andmean_improvement > 0; otherwise"no clear change".
Use the paired scipy.stats function — do not treat the two columns as independent groups.
Common misconceptions, gathered
Five t-test traps
- Unpaired when the data is paired (or vice versa). Match the test to the design — same units twice means paired.
- Assuming equal variances by default. Real groups rarely have
identical spreads; use Welch's (
equal_var=False). - "The data must be normal." It's the sampling distribution of the mean that needs to be roughly normal; the CLT handles that for decent sample sizes. Raw skew is usually fine if n is large enough.
- Significance = importance. A tiny, meaningless difference becomes "significant" with enough data. Report an effect size and an interval.
- A non-significant result proves "no difference." It means you couldn't detect one — the effect may be real but small. (The not-guilty verdict from Hypothesis Testing.)
Check your understanding
The t-statistic in a two-sample t-test is best described as which ratio?
The difference in means divided by the total sample size
The observed difference in means divided by the standard error of that difference (signal divided by noise)
The p-value divided by the significance level
The variance of group A divided by the variance of group B
Why is Welch's t-test (equal_var=False) the recommended default for two independent samples?
It always produces a smaller p-value, making effects easier to detect
It removes the need for the independence assumption
It does not assume the two groups have equal variances, so it stays well-calibrated when spreads differ while costing little when they don't
It works even when the data is categorical
Your raw measurements are clearly right-skewed (a long tail), but you have n = 500 observations. Can you still use a t-test on the mean?
No — the t-test requires the raw data itself to be normally distributed
Yes — with a large sample the central limit theorem makes the sampling distribution of the mean approximately normal, which is what the test actually needs
No — skewed data always requires a nonparametric test regardless of sample size
Only if you first delete the long tail as outliers
A nutrition trial weighs the same 40 people before and after a diet. An analyst runs an independent two-sample t-test on the before and after columns and finds no significant difference. What's the most likely problem?
The sample is too small for any t-test
An independent test ignores the pairing, dumping large person-to-person variation into the noise and likely hiding a real effect
Independent t-tests can never detect weight change
The p-value should have been doubled
A two-sample t-test on 2 million users returns p < 0.0001 for a difference in average session length of 0.4 seconds. What is the right interpretation?
The effect is large because the p-value is so small
The result must be a mistake because nothing changes session length by under a second
The difference is almost certainly real but probably too small to matter; report an effect size and a confidence interval to judge importance
We should rerun with fewer users so the result becomes non-significant
Key takeaways
What to carry forward
- A t-test compares means; its statistic is signal ÷ noise (difference ÷ standard error), and the p-value asks how surprising that ratio is if there's no real difference.
- Match the design: one group vs a target →
ttest_1samp; two independent groups →ttest_ind(..., equal_var=False)(Welch's); same units measured twice →ttest_rel(paired). - Default to Welch's for two independent samples — it drops the fragile equal-variance assumption at almost no cost.
- Pair when the data is paired. Pairing removes between-unit variation and buys power; mismatching the design wastes it or invalidates the test.
- The t-test needs the sampling distribution of the mean to be roughly normal (the CLT helps), not the raw data. Small + skewed → go nonparametric (Correlation and Nonparametric Tests).
- Always pair the verdict with a confidence interval and an effect size — significance is not importance.
Errors and Power
The two ways a hypothesis test can be wrong, why power is the chance of catching a real effect, and why underpowered studies quietly poison the research literature.
ANOVA and Chi-Square
Two essential tests beyond the t-test — one-way ANOVA for comparing the means of three or more groups, and chi-square tests for categorical data (independence and goodness-of-fit), with the intuition, the assumptions, and how to read the results.