Errors and Power
The two ways a hypothesis test can be wrong, why power is the chance of catching a real effect, and why underpowered studies quietly poison the research literature.
A hypothesis test makes a yes/no decision under uncertainty, so it can be wrong in exactly two ways — and they are not symmetric. You can sound a false alarm (declare an effect that isn't there) or you can miss a real effect (fail to flag something that is). Understanding these two errors, and the power to avoid the second one, separates analysts who run trustworthy experiments from those who get lucky and don't know it.
This is also where the deepest misconceptions live. "We found no significant difference, so the treatment doesn't work." "The p-value was below 0.05, so who cares about power?" Both are dangerously wrong, and by the end of this page you'll be able to simulate exactly why.
The 2x2 decision matrix
Every test has an unknowable truth (H₀ is really true, or it's really false) and a decision you make (reject H₀, or fail to reject). Cross them and you get four outcomes — two right, two wrong.
Read it as a table:
| You fail to reject H₀ | You reject H₀ | |
|---|---|---|
| H₀ is true | Correct (specificity) | Type I error — false positive, rate α |
| H₀ is false | Type II error — false negative, rate β | Correct detection — power = 1 − β |
Two definitions to memorize, because everything else builds on them:
- Type I error (α): rejecting a true null. A false positive — you "discovered" an effect that isn't real. Its rate is exactly the significance level α you chose.
- Type II error (β): failing to reject a false null. A false negative — there was a real effect and you missed it.
- Power = 1 − β: the probability of correctly rejecting a false null. The chance your test catches a real effect when one exists.
A mnemonic for which is which
Type I comes first, and it's the error of being too eager — crying wolf when there's no wolf (false positive). Type II is the error of being too timid — missing the wolf that's really there (false negative). The "boy who cried wolf" story has both: a Type I error early (false alarm) and a fatal Type II error at the end (real wolf, ignored).
The α/β tradeoff: you can't shrink both for free
Here's the tension. α is the bar for "convincing." Lower the bar (smaller α) and you make fewer false alarms — but you also reject less often overall, so you miss more real effects (β goes up). Raise the bar and the reverse happens. With a fixed sample, pushing one error rate down pushes the other up.
This is why the "right" α depends on which mistake is costlier. A spam filter that flags a real email as spam (false positive) is annoying; one that lets spam through (false negative) is mildly annoying — so you might tolerate more false negatives. But a smoke detector should scream at the faintest whiff: a false alarm is cheap, a missed fire is catastrophic, so you accept many Type I errors to crush Type II.
Choosing alpha is a values decision, not a math fact
There is nothing sacred about 0.05. Pick α by asking: in my problem, how bad is a false positive compared to a false negative? When false alarms are expensive (a costly drug rollout), use a stricter α. When missing a real effect is the disaster (early disease screening), loosen α or — better — buy more power with a bigger sample.
The only way to shrink both: more data
The escape hatch from the tradeoff is sample size. With a fixed n
you trade α against β. But collect more data and you can lower
both — a bigger sample sharpens the test's ability to tell signal from
noise, so the same α buys you more power. Power has four levers.
Three of these you often can't control: the true effect size is whatever
nature made it, α is usually pinned by convention, and variance is
limited by how clean your measurement is. The one lever firmly in your
hands is n. That's why "how many samples do I need?" is the central
question of experiment design.
Simulating power: just run the test many times and count
Power sounds abstract until you compute it, and the simulation recipe is beautifully simple:
- Build a world where a real effect exists (you set its size).
- Draw a sample and run the test. Did it reject H₀? Yes or no.
- Repeat thousands of times.
- Power = the fraction of repeats that rejected — the rate at which your test catches the effect you planted.
With an effect of 0.5 and only 30 per group, power lands somewhere around 0.5 — meaning a coin flip whether you'd detect a real effect of that size. Half the time you'd run this experiment, find nothing significant, and (if you misread it) conclude the effect doesn't exist. It does; your test just wasn't strong enough.
The most dangerous misreading on this page
A non-significant result from an underpowered study does not mean "no effect." If power is 0.4, then even when the effect is unmistakably real, you'll fail to reject H₀ 60% of the time. "We found no significant difference" with low power is almost uninformative — the effect could easily be there, undetected. Always ask "what was my power?" before reading a null result as evidence of absence.
Power rises with sample size and with effect size
Two of the four levers are the ones you reason about most. Let's watch
power climb as we increase n, and separately as we increase the true
effect. We'll draw both curves with Plotly.
The curve climbs toward 1.0 as n grows: more data, more power. The green
line marks 0.80, the conventional minimum power people design for —
you want at least an 80% chance of catching a real effect before you bother
running the study. Now hold n fixed and grow the effect instead:
The two punchlines: big effects are easy to detect (high power even at
modest n), tiny effects need either lots of data or they slip through.
And the leftmost point of the second chart is a gift — when the effect is
zero, H₀ is true, so "rejecting" is a Type I error. That point sits
near 0.05, which is exactly α. Let's verify that directly.
The Type I error rate really is α
A well-behaved test, run when H₀ is true, should reject about α of the time — no more, no less. Let's confirm by simulating a world with no effect at all and counting false positives.
The false-positive rate lands right around 0.05. That is the literal meaning of α: the rate at which you'll cry wolf when there's no wolf. It is not "the probability H₀ is true" — it's a property of your decision rule, fixed in advance, that holds whenever H₀ happens to be true.
α is NOT the probability that H0 is true
α = 0.05 means: if H₀ is true, I will wrongly reject it 5% of the time. It is a long-run error rate of your procedure, not a statement about any particular hypothesis. The probability that H₀ is true isn't something a single test can give you at all — same trap as misreading the p-value in P-values.
In the simulation above, both groups were drawn from the same distribution, yet the test rejected about 5% of the time. What does that 5% represent?
The power of the test
The Type I error rate, which equals the chosen
The probability that is true
A sign the test is broken
Why underpowered studies are doubly dangerous
You might think an underpowered study is merely useless — it often fails to find real effects. It's worse than that. Underpowered studies don't just miss effects; the "significant" results they do produce are inflated — systematically too big. This is the winner's curse of low power.
The logic: when power is low, a true effect only clears the significance bar on the lucky runs where noise happened to exaggerate it. The unlucky runs (where noise shrank the estimate) stay non-significant and get filed away. So the published, significant estimates are a biased sample — the overshoots. Let's see it.
The average of all estimates sits near the true 0.3 — the test is unbiased overall. But the average of just the significant ones is much larger. In a world where only significant results get published or believed, low-powered studies systematically overstate effects. This, along with p-hacking from P-values, is a leading driver of the replication crisis — splashy findings that shrink or vanish when someone repeats them with adequate power.
Power matters even when p < 0.05
A tempting misconception: "I got significance, so power is irrelevant now." Wrong. Power governs whether your significant result is trustworthy and replicable. A p < 0.05 from a badly underpowered study is exactly the kind of result that fails to replicate and whose effect size is inflated. Power is not just about avoiding false negatives — it's about whether your positives can be believed.
Challenge 1 — Estimate power by simulation
A real effect exists: the treatment mean is effect higher than control, both with standard deviation 1. Estimate this test's power at n_per_group and alpha by simulation.
- Run
n_experimentssimulated studies. In each, drawcontrolfromnormal(0, 1, n_per_group)andtreatmentfromnormal(effect, 1, n_per_group), then a two-sample t-test. - Count how often the test rejects $H_0$ (p-value
<= alpha). - Store the rejection rate (rejections / n_experiments) as a float called
power.
Use the provided rng for all randomness. Power is just the fraction of experiments that reach significance when the effect is real.
Challenge 2 — Confirm the Type I error rate
Now there is no real effect at all — both groups come from normal(0, 1, n_per_group). Any rejection is a Type I error. Estimate that error rate.
- Run
n_experimentssimulated studies where both groups are drawn fromnormal(0, 1, n_per_group). - Count how often the two-sample t-test rejects $H_0$ (p-value
<= alpha). - Store the rejection rate as a float called
type_I_rate.
Because $H_0$ is true here, this rate should come out close to alpha (0.05).
Putting it together: the full picture
Notice that power analysis happens before you collect data, not after.
You decide the smallest effect worth detecting, then compute the n that
gives you an 80% (or higher) chance of catching it. Running first and
asking about power later is how underpowered studies happen.
Check your understanding
A new drug truly works, but a small trial fails to reach significance and concludes "no effect." Which error was made, and what is its rate called?
A Type I error, with rate
A Type II error, with rate eta
A false positive, with rate 1 - eta
No error — non-significant means there is no effect
Statistical power is best defined as:
The probability of making a Type I error
The probability that is false
The probability of correctly rejecting when is actually false (that is, 1 - eta)
1 minus the p-value
Which change will decrease the power of a two-sample t-test, all else equal?
Increasing the sample size
Studying a larger true effect
Reducing from 0.05 to 0.01
Reducing the variability (noise) in the measurements
Why is reporting power especially important after a non-significant result?
Because a non-significant result is always wrong
Because with low power, a real effect is frequently missed, so "not significant" may mean "undetected" rather than "absent"
Because power changes the p-value after the fact
Because high power guarantees the null is true
A team brags that they got p < 0.05 from a tiny pilot study, and says power "doesn't matter now that it's significant." What is the flaw?
They are right; once significant, power is irrelevant
Low power makes a significant result less trustworthy and inflates the apparent effect size, so it often fails to replicate
Power only matters for Bayesian analyses
A significant p-value from a small sample is impossible
You want at least an 80% chance of detecting a difference in conversion of 1 percentage point, if it exists. What should you do before launching the A/B test?
Launch immediately and check power afterward if the result is null
Run a power analysis to find the sample size that yields power >= 0.80 for that effect size, then collect that much data
Pick whatever sample size is convenient; power takes care of itself
Lower to 0.001 to be safe
Key takeaways
What to carry forward
- A test can err two ways: Type I (false positive, reject a true H₀, rate α) and Type II (false negative, miss a real effect, rate β).
- Power = 1 − β is the chance of detecting a real effect. Design for at least 0.80.
- With a fixed sample, α and β trade off; the way to
shrink both is a larger
n. Power grows with effect size,n, larger α, and smaller variance. - α is a false-positive rate, not the probability H₀ is
true. Power still matters when
p < 0.05— it governs trust and replication. - A non-significant result from a low-power study means undetected, not absent; and low power inflates the effects it does detect (the winner's curse), feeding the replication crisis.
- Plan power before collecting data; interpret results with P-values, Effect Sizes, and the sampling logic from Sampling Distributions.
P-values
The single most misunderstood number in statistics — what a p-value actually measures, built visually as a tail area under the null, and the long list of things it is not.
t-Tests
How to compare means under uncertainty with one-sample, two-sample (Welch), and paired t-tests — the t-statistic as signal divided by noise, the assumptions that matter, and how to interpret t, p, an interval, and an effect size.