Errors and Power

The two ways a hypothesis test can be wrong, why power is the chance of catching a real effect, and why underpowered studies quietly poison the research literature.

A hypothesis test makes a yes/no decision under uncertainty, so it can be wrong in exactly two ways — and they are not symmetric. You can sound a false alarm (declare an effect that isn't there) or you can miss a real effect (fail to flag something that is). Understanding these two errors, and the power to avoid the second one, separates analysts who run trustworthy experiments from those who get lucky and don't know it.

This is also where the deepest misconceptions live. "We found no significant difference, so the treatment doesn't work." "The p-value was below 0.05, so who cares about power?" Both are dangerously wrong, and by the end of this page you'll be able to simulate exactly why.

The 2x2 decision matrix

Every test has an unknowable truth (H₀ is really true, or it's really false) and a decision you make (reject H₀, or fail to reject). Cross them and you get four outcomes — two right, two wrong.

Read it as a table:

	You fail to reject H₀	You reject H₀
H₀ is true	Correct (specificity)	Type I error — false positive, rate α
H₀ is false	Type II error — false negative, rate β	Correct detection — power = 1 − β

Two definitions to memorize, because everything else builds on them:

Type I error (α): rejecting a true null. A false positive — you "discovered" an effect that isn't real. Its rate is exactly the significance level α you chose.
Type II error (β): failing to reject a false null. A false negative — there was a real effect and you missed it.
Power = 1 − β: the probability of correctly rejecting a false null. The chance your test catches a real effect when one exists.

A mnemonic for which is which

Type I comes first, and it's the error of being too eager — crying wolf when there's no wolf (false positive). Type II is the error of being too timid — missing the wolf that's really there (false negative). The "boy who cried wolf" story has both: a Type I error early (false alarm) and a fatal Type II error at the end (real wolf, ignored).

The α/β tradeoff: you can't shrink both for free

Here's the tension. α is the bar for "convincing." Lower the bar (smaller α) and you make fewer false alarms — but you also reject less often overall, so you miss more real effects (β goes up). Raise the bar and the reverse happens. With a fixed sample, pushing one error rate down pushes the other up.

This is why the "right" α depends on which mistake is costlier. A spam filter that flags a real email as spam (false positive) is annoying; one that lets spam through (false negative) is mildly annoying — so you might tolerate more false negatives. But a smoke detector should scream at the faintest whiff: a false alarm is cheap, a missed fire is catastrophic, so you accept many Type I errors to crush Type II.

Choosing alpha is a values decision, not a math fact

There is nothing sacred about 0.05. Pick α by asking: in my problem, how bad is a false positive compared to a false negative? When false alarms are expensive (a costly drug rollout), use a stricter α. When missing a real effect is the disaster (early disease screening), loosen α or — better — buy more power with a bigger sample.

The only way to shrink both: more data

The escape hatch from the tradeoff is sample size. With a fixed n you trade α against β. But collect more data and you can lower both — a bigger sample sharpens the test's ability to tell signal from noise, so the same α buys you more power. Power has four levers.

Three of these you often can't control: the true effect size is whatever nature made it, α is usually pinned by convention, and variance is limited by how clean your measurement is. The one lever firmly in your hands is n. That's why "how many samples do I need?" is the central question of experiment design.

Simulating power: just run the test many times and count

Power sounds abstract until you compute it, and the simulation recipe is beautifully simple:

Build a world where a real effect exists (you set its size).
Draw a sample and run the test. Did it reject H₀? Yes or no.
Repeat thousands of times.
Power = the fraction of repeats that rejected — the rate at which your test catches the effect you planted.

With an effect of 0.5 and only 30 per group, power lands somewhere around 0.5 — meaning a coin flip whether you'd detect a real effect of that size. Half the time you'd run this experiment, find nothing significant, and (if you misread it) conclude the effect doesn't exist. It does; your test just wasn't strong enough.

The most dangerous misreading on this page

A non-significant result from an underpowered study does not mean "no effect." If power is 0.4, then even when the effect is unmistakably real, you'll fail to reject H₀ 60% of the time. "We found no significant difference" with low power is almost uninformative — the effect could easily be there, undetected. Always ask "what was my power?" before reading a null result as evidence of absence.

Power rises with sample size and with effect size

Two of the four levers are the ones you reason about most. Let's watch power climb as we increase n, and separately as we increase the true effect. We'll draw both curves with Plotly.

The curve climbs toward 1.0 as n grows: more data, more power. The green line marks 0.80, the conventional minimum power people design for — you want at least an 80% chance of catching a real effect before you bother running the study. Now hold n fixed and grow the effect instead:

The two punchlines: big effects are easy to detect (high power even at modest n), tiny effects need either lots of data or they slip through. And the leftmost point of the second chart is a gift — when the effect is zero, H₀ is true, so "rejecting" is a Type I error. That point sits near 0.05, which is exactly α. Let's verify that directly.

The Type I error rate really is α

A well-behaved test, run when H₀ is true, should reject about α of the time — no more, no less. Let's confirm by simulating a world with no effect at all and counting false positives.

The false-positive rate lands right around 0.05. That is the literal meaning of α: the rate at which you'll cry wolf when there's no wolf. It is not "the probability H₀ is true" — it's a property of your decision rule, fixed in advance, that holds whenever H₀ happens to be true.

α is NOT the probability that H0 is true

α = 0.05 means: if H₀ is true, I will wrongly reject it 5% of the time. It is a long-run error rate of your procedure, not a statement about any particular hypothesis. The probability that H₀ is true isn't something a single test can give you at all — same trap as misreading the p-value in P-values.

QuestionSelect one

In the simulation above, both groups were drawn from the same distribution, yet the test rejected $H_0$ about 5% of the time. What does that 5% represent?

The power of the test

The Type I error rate, which equals the chosen $alpha$

The probability that $H_0$ is true

A sign the test is broken

Why underpowered studies are doubly dangerous

You might think an underpowered study is merely useless — it often fails to find real effects. It's worse than that. Underpowered studies don't just miss effects; the "significant" results they do produce are inflated — systematically too big. This is the winner's curse of low power.

The logic: when power is low, a true effect only clears the significance bar on the lucky runs where noise happened to exaggerate it. The unlucky runs (where noise shrank the estimate) stay non-significant and get filed away. So the published, significant estimates are a biased sample — the overshoots. Let's see it.

The average of all estimates sits near the true 0.3 — the test is unbiased overall. But the average of just the significant ones is much larger. In a world where only significant results get published or believed, low-powered studies systematically overstate effects. This, along with p-hacking from P-values, is a leading driver of the replication crisis — splashy findings that shrink or vanish when someone repeats them with adequate power.

Power matters even when p < 0.05

A tempting misconception: "I got significance, so power is irrelevant now." Wrong. Power governs whether your significant result is trustworthy and replicable. A p < 0.05 from a badly underpowered study is exactly the kind of result that fails to replicate and whose effect size is inflated. Power is not just about avoiding false negatives — it's about whether your positives can be believed.

Challenge 1 — Estimate power by simulation

A real effect exists: the treatment mean is effect higher than control, both with standard deviation 1. Estimate this test's power at n_per_group and alpha by simulation.

Run n_experiments simulated studies. In each, draw control from normal(0, 1, n_per_group) and treatment from normal(effect, 1, n_per_group), then a two-sample t-test.
Count how often the test rejects $H_0$ (p-value <= alpha).
Store the rejection rate (rejections / n_experiments) as a float called power.

Use the provided rng for all randomness. Power is just the fraction of experiments that reach significance when the effect is real.

Challenge 2 — Confirm the Type I error rate

Now there is no real effect at all — both groups come from normal(0, 1, n_per_group). Any rejection is a Type I error. Estimate that error rate.

Run n_experiments simulated studies where both groups are drawn from normal(0, 1, n_per_group).
Count how often the two-sample t-test rejects $H_0$ (p-value <= alpha).
Store the rejection rate as a float called type_I_rate.

Because $H_0$ is true here, this rate should come out close to alpha (0.05).

Putting it together: the full picture

Notice that power analysis happens before you collect data, not after. You decide the smallest effect worth detecting, then compute the n that gives you an 80% (or higher) chance of catching it. Running first and asking about power later is how underpowered studies happen.

Check your understanding

QuestionSelect one

A new drug truly works, but a small trial fails to reach significance and concludes "no effect." Which error was made, and what is its rate called?

A Type I error, with rate $alpha$

A Type II error, with rate $eta$

A false positive, with rate $1 - eta$

No error — non-significant means there is no effect

QuestionSelect one

Statistical power is best defined as:

The probability of making a Type I error

The probability that $H_0$ is false

The probability of correctly rejecting $H_0$ when $H_0$ is actually false (that is, $1 - eta$ )

1 minus the p-value

QuestionSelect one

Which change will decrease the power of a two-sample t-test, all else equal?

Increasing the sample size

Studying a larger true effect

Reducing $alpha$ from 0.05 to 0.01

Reducing the variability (noise) in the measurements

QuestionSelect one

Why is reporting power especially important after a non-significant result?

Because a non-significant result is always wrong

Because with low power, a real effect is frequently missed, so "not significant" may mean "undetected" rather than "absent"

Because power changes the p-value after the fact

Because high power guarantees the null is true

QuestionSelect one

A team brags that they got p < 0.05 from a tiny pilot study, and says power "doesn't matter now that it's significant." What is the flaw?

They are right; once significant, power is irrelevant

Low power makes a significant result less trustworthy and inflates the apparent effect size, so it often fails to replicate

Power only matters for Bayesian analyses

A significant p-value from a small sample is impossible

QuestionSelect one

You want at least an 80% chance of detecting a difference in conversion of 1 percentage point, if it exists. What should you do before launching the A/B test?

Launch immediately and check power afterward if the result is null

Run a power analysis to find the sample size that yields power >= 0.80 for that effect size, then collect that much data

Pick whatever sample size is convenient; power takes care of itself

Lower $alpha$ to 0.001 to be safe

Key takeaways

What to carry forward

A test can err two ways: Type I (false positive, reject a true H₀, rate α) and Type II (false negative, miss a real effect, rate β).
Power = 1 − β is the chance of detecting a real effect. Design for at least 0.80.
With a fixed sample, α and β trade off; the way to shrink both is a larger n. Power grows with effect size, n, larger α, and smaller variance.
α is a false-positive rate, not the probability H₀ is true. Power still matters when p < 0.05 — it governs trust and replication.
A non-significant result from a low-power study means undetected, not absent; and low power inflates the effects it does detect (the winner's curse), feeding the replication crisis.
Plan power before collecting data; interpret results with P-values, Effect Sizes, and the sampling logic from Sampling Distributions.

Errors and Power

On this page