Effect Sizes
Why a p-value tells you whether an effect exists but never how big it is — and how Cohen's d, correlation r, and risk ratios measure the size that actually drives decisions, always paired with a confidence interval.
You ran the test, the p-value came back 0.001, and the dashboard lit
up green. The new onboarding flow is "statistically significant." Time
to ship — right?
Not so fast. A p-value answers exactly one question: is there an effect at all? It says nothing about the question your boss actually cares about: how big is the effect? Those are different questions, and conflating them is one of the most expensive mistakes in applied data science. A medication can lower blood pressure by a "highly significant" 0.2 mmHg — real, repeatable, and utterly useless. A redesign can lift revenue by a thumping 8% with a p-value that only squeaks under 0.05. Significance and importance are not the same axis.
Effect size is the missing axis. It's a number that says how much, on a scale you can reason about, independent of how many rows you collected. This page is about measuring magnitude — and about why, with a big enough sample, everything becomes significant whether it matters or not.
The problem: significance is not importance
Here is the uncomfortable truth that motivates this entire page. The p-value depends on two things tangled together: how big the effect is, and how much data you have. Crank up the sample size and the p-value shrinks toward zero even if the underlying effect is microscopic. With ten million users, a difference of one-hundredth of a percent in conversion will be "significant." Significance, past a certain sample size, is almost guaranteed — so it stops carrying information about whether you should care.
A decision needs both boxes. The p-value guards against fooling yourself with noise; the effect size tells you whether the real thing is worth acting on. Reporting one without the other is half an answer.
The misconception that costs the most
A small p-value does not mean a big or important effect. p = 0.0001
means "this is very unlikely to be pure chance," not "this is a large
effect." A trivial effect measured on a huge sample produces a tiny
p-value. Always ask how big, not just is it real.
Same p-value, wildly different effects
Two studies can land on the identical p-value while describing effects that are worlds apart in magnitude. The p-value alone cannot tell them apart — only an effect size can.
The p-value is a function of signal divided by noise divided by sample-size-shrinkage. Two completely different signals can produce the same ratio. That's why mature analyses lead with the effect size and treat the p-value as a footnote.
Cohen's d: a standardized mean difference
The most common effect size for comparing two groups' means is Cohen's d. The idea is simple and intuitive: take the difference in means, but express it in units of the data's own spread.
d = (x̄₁ − x̄₂) / s_pooled
A d of 1 means the two group means sit one full standard deviation
apart. A d of 0.1 means they barely overlap-differ at all. Dividing by
the standard deviation is what makes d unitless and comparable
across contexts: a d of 0.5 means the same "amount of separation"
whether you're measuring dollars, milliseconds, or test scores.
Why divide by the standard deviation?
A raw difference of "5" is meaningless on its own — 5 what, relative to how much things naturally vary? If salaries range over tens of thousands, a \3, a \$5 gap is enormous. Cohen's d puts the difference on the ruler of the data's own variability, so a single number captures practical separation.
Let's compute it from two groups in NumPy. The pooled standard deviation combines both groups' spread.
The small / medium / large guideposts (use with care)
Jacob Cohen offered rough rules of thumb so people would have some
anchor for interpreting d. They are conventions, not laws of nature.
| Cohen's d | Rough label | Overlap of the two distributions |
|---|---|---|
| ~0.2 | small | the groups overlap a lot |
| ~0.5 | medium | a noticeable, visible separation |
| ~0.8 | large | the groups are clearly distinct |
Context defines 'big', not a table
These labels are starting points, not verdicts. In a domain where tiny
gains compound across millions of users (ad click-through, search
ranking), a d of 0.05 can be worth a fortune. In a medical trial, a
"small" d that reduces mortality is enormous. Never let a lookup table
override domain knowledge about what magnitude matters here.
Demonstrating significance ≠ importance
Time to prove the central claim with code. We'll take a difference so tiny it's practically nothing — and then collect a giant sample. Watch the p-value collapse to "highly significant" while Cohen's d stays negligible.
The p-value is microscopic — the difference is "real" in the sense that it isn't pure chance. But the effect size is near zero, so the finding is real and irrelevant at the same time. That is the whole lesson of this page in one code block. We first met this tension in P-values and Hypothesis Testing; effect sizes are the cure.
The flip side: tiny samples hide real effects
The same tangle works in reverse. A genuinely large effect can come back non-significant if your sample is small — there simply wasn't enough data to rule out chance. That's a power problem (see Errors and Power). The effect size, computed on the same small sample, will still report a large magnitude. This is exactly why you report the effect size and a confidence interval, not just the p-value.
Effect size does NOT depend on sample size (the p-value does)
This is the property that makes effect sizes so valuable, and it's worth stating plainly because it's a common misconception. As you collect more data:
- The p-value marches toward 0 for any nonzero effect — it depends
heavily on
n. - The effect size converges to the true magnitude — adding data makes it more precise, not bigger or smaller.
Let's verify it. We fix a true d of about 0.3 and watch what happens
to the p-value versus the estimated d as n grows.
The p-value falls off a cliff while d stays pinned near 0.30. The
effect size is a property of the world; the p-value is a property of
the world and your sample size.
Correlation r is itself an effect size
You already met the Pearson correlation r in Correlation and
Nonparametric Tests. It does double duty: it's both a descriptive
statistic and a perfectly good effect size for the strength of a
linear relationship. It's already standardized (it lives in [-1, 1]),
so no extra scaling is needed.
| |r| | Rough strength |
|---|---|
| ~0.1 | weak |
| ~0.3 | moderate |
| ~0.5+ | strong |
A handy companion is r², the proportion of variance explained:
an r of 0.3 means the relationship accounts for only 0.09 — about 9%
— of the variation. That reframing often deflates an impressive-sounding
correlation.
r as a bridge between worlds
Because r is a standardized effect size, it's the natural way to
report "how strongly are X and Y related?" without dragging in the units
of X or Y. The p-value next to it only answers "is the correlation
distinguishable from zero?" — once again, existence versus magnitude.
Effect sizes for proportions: risk difference and relative risk
When your outcome is a yes/no event — converted or not, churned or not, recovered or not — the natural effect sizes compare two proportions. There are two complementary ways to do it, and they tell genuinely different stories.
- Risk difference (absolute):
p_treat − p_control. "The conversion rate went up by 2 percentage points." - Relative risk / risk ratio (relative):
p_treat / p_control. "The conversion rate was 1.4× higher."
Relative numbers can mislead
"40% more likely!" sounds huge, but if the baseline is 0.001%, the absolute gain is 0.0004% — practically nothing. Headlines love relative risk because it sounds dramatic; honest analysis reports both the absolute risk difference and the relative risk so the reader can judge magnitude in context.
Always pair an effect size with a confidence interval
An effect size is still a single number estimated from a noisy sample —
so it has its own uncertainty. Reporting d = 0.42 alone repeats the
exact sin of reporting a bare point estimate that Confidence Intervals
warned against. The grown-up move is to report the effect size with a
confidence interval around it: d = 0.42, 95% CI [0.20, 0.64].
That interval does triple duty:
- It shows the precision of the magnitude (narrow = pinned down).
- If it excludes 0, you have significance and a sense of size, in one object.
- If it straddles 0 or includes both trivial and large values, it honestly signals "we don't yet know how big this is."
The reporting standard to adopt
Never report a p-value by itself. Report the effect size (how big), its confidence interval (how precise), and the p-value (is it distinguishable from no effect) together. That trio answers every question a decision-maker actually has. You'll use exactly this trio when we run a full experiment in A/B Testing.
Practice
Two NumPy arrays, before and after, have been created for you (independent groups, not paired).
Compute Cohen's d for the difference after - before and store it as a float in a variable named d.
Use the pooled standard deviation with sample variances (ddof=1):
s_pooled = sqrt( ((na-1)*var_before + (nb-1)*var_after) / (na+nb-2) )d = (mean_after - mean_before) / s_pooled
d must be a Python float and, for this data, positive.
Two arrays, a and b, have been created from almost-identical processes but with a very large sample size.
Demonstrate the gap between significance and importance by computing both:
p: the p-value fromscipy.stats.ttest_ind(a, b)— store it as a float.d: Cohen's d forb - a(pooled sd,ddof=1) — store it as a float.
For this data, p should be below 0.05 (significant) while abs(d) should be below 0.1 (negligible). Your job is to show both can be true at once.
Check your understanding
A study with 2 million users finds that a new font color increases time-on-page by 0.3 seconds, with a p-value of 0.000001. What is the most reasonable conclusion?
The effect is large and important because the p-value is extremely small
The effect is almost certainly real but probably too small to matter; you should check the effect size before acting
The result is invalid because the p-value is suspiciously small
There is no effect because 0.3 seconds is small
You compute Cohen's d on a sample of 30 and again on a sample of 30,000 drawn from the same population. How do you expect the two estimates of d to compare?
The d from the larger sample will be substantially bigger
The d from the larger sample will be substantially smaller
Both should be near the same true value, but the larger sample's estimate will be more precise
They are unrelated because d depends entirely on sample size
A press release says a supplement makes an adverse event "50% more likely." Which question most directly checks whether this matters?
What was the p-value of the comparison?
What is the absolute risk difference — 50% more likely than what baseline rate?
How large was the sample?
What is the correlation coefficient?
A correlation of r = 0.2 between ad spend and sales is reported as a "meaningful relationship." Roughly how much of the variation in sales does it explain?
About 20%
About 4%
About 40%
About 80%
Why is it good practice to report an effect size with a confidence interval rather than the effect size alone?
Because the confidence interval replaces the need for an effect size
Because a confidence interval makes the p-value unnecessary to compute
Because the effect size is estimated from a noisy sample, and the interval shows how precisely the magnitude is pinned down
Because confidence intervals are always narrower than effect sizes
Key takeaways
- A p-value answers "is there an effect?"; an effect size answers "how big?" They are different axes.
- With a large enough sample, trivial effects become significant — so significance alone never justifies acting.
- Cohen's d standardizes a mean difference; r (and r²) measures relationship strength; risk difference and relative risk compare proportions. Report relative and absolute for rates.
- Unlike the p-value, an effect size does not grow with sample size — more data just makes it more precise.
- Always pair an effect size with a confidence interval. We put this whole toolkit to work in A/B Testing.
Correlation and Nonparametric Tests
Measuring relationships with Pearson and Spearman correlation, a light touch of regression intuition, and the distribution-free Mann-Whitney U and Wilcoxon tests for when t-tests are unsafe.
Statistical Fallacies
The recurring reasoning traps that sink real analyses — Simpson's paradox, confounding, p-hacking, base-rate neglect, survivorship bias, regression to the mean, cherry-picking, and the Texas sharpshooter — what each one is, a concrete example, and how to avoid it.