P-values

The single most misunderstood number in statistics — what a p-value actually measures, built visually as a tail area under the null, and the long list of things it is not.

The p-value is the most quoted and most misunderstood number in all of applied statistics. People treat it as "the probability the result is real," "the probability we're wrong," or "the chance it was a fluke." All of those are wrong, and believing them leads to bad decisions worth real money and, in medicine, real lives.

This page does two things. First, it builds the p-value from scratch, by simulation, so you can literally see what it measures — an area under a curve. Second, it relentlessly clears away every classic misreading, because knowing what a p-value is not is just as important as knowing what it is.

The one-sentence definition

A p-value is the probability of observing data at least as extreme as what you saw, assuming the null hypothesis H₀ is true.

Read it slowly. Three pieces carry all the meaning:

"assuming H₀ is true" — the whole calculation lives in a hypothetical world where there is no effect. The p-value never asks whether that world is the real one.
"at least as extreme" — not the probability of your exact result (which is usually near zero for continuous data), but the probability of your result or anything more surprising.
"data" — summarized by a test statistic. We measure extremeness on that one number.

Everything else on this page is a consequence of, or a contrast with, that sentence.

It's a conditional probability, and the condition matters

In symbols, a p-value is P(data this extreme or more | H₀). The thing to the right of the bar — "given H₀" — is assumed, not concluded. This is exactly why a p-value can never be the probability that H₀ is true: you cannot end up with a probability about the thing you assumed at the start.

Build it yourself: the p-value as a tail area

Here is the idea that makes p-values click. Suppose H₀ is true. Then there is a whole distribution of test-statistic values you might get, just from random sampling — the null distribution. Your observed statistic is one point on that distribution. The p-value is simply how much of the null distribution is at least as extreme as your point — the shaded tail.

Let's make that picture real. We'll simulate the null distribution for a coin-flip example, then read the p-value straight off it. A friend claims their coin is fair. You flip it 100 times and get 62 heads. Under H₀ ("the coin is fair, p = 0.5"), how surprising is 62-or-more-extreme?

The p-value you get is the proportion of fair-coin simulations that were as far from 50 as your real result. That's all it is: a tail area under the null distribution. Nothing in that calculation knows or claims whether the coin is actually fair — we assumed it was fair to draw the distribution in the first place.

Why 'two-sided' counts both tails here

Because the claim was just "the coin is biased" (either way), a result of 38 heads would be exactly as surprising as 62 heads — both sit 12 away from 50. So "at least as extreme" includes both tails. If the claim had been specifically "biased toward heads," you'd count only the upper tail and the p-value would be about half as large. This is the one-sided vs two-sided choice from Hypothesis Testing, now visible as which tails you shade.

Does our hand-built p-value match the textbook one?

A simulated tail area should agree with what scipy.stats computes analytically. Let's check against an exact binomial test.

Building the p-value as a tail area, visually

The same idea works for continuous statistics. Below we simulate the null distribution of a difference in means (two groups with no real difference), then shade everything at least as extreme as one observed gap. The vertical lines are the observed difference and its mirror image; the p-value is the shaded probability beyond them.

Look at the histogram. It is centered on zero — because under H₀ the two groups are identical, so most random gaps are near nothing. Your observed difference sits out in the tail. The further out it sits, the thinner the tail beyond it, and the smaller the p-value. A small p-value just means "the null distribution rarely reaches this far."

QuestionSelect one

In the histogram above, the null distribution of the difference in means is centered on zero. Why?

Because the observed difference happened to be small

Because under $H_0$ the two groups come from the same process, so most random differences are near zero

Because differences in means are always zero on average in any dataset

Because we subtracted the mean from the data first

What a p-value is NOT

This is the heart of the page. Every item below is a misreading you will hear from smart people. Burn the corrections in.

Let's take the four most damaging ones in turn.

It is NOT the probability that H₀ is true

This is the big one. A p-value is computed assuming H₀. You cannot feed "H₀ is true" into a calculation and have its probability fall out the other end. P(data | H₀) and P(H₀ | data) are different quantities, and confusing them is called the prosecutor's fallacy.

A medical-screening example makes the gap vivid. The probability of a positive test given you're healthy is small. The probability you're healthy given a positive test can still be large, if the disease is rare. Flipping the condition changes the answer entirely.

The false-positive rate is 5%, yet most positives are false alarms, because the disease is rare. In exactly the same way, a p-value of 0.05 is not a 5% chance that H₀ is true. To get P(H₀ | data) you'd also need the prior plausibility of H₀ — which the p-value never uses.

The prosecutor's fallacy, in one line

"The p-value is 0.01, so there's a 99% chance the effect is real" is false. The p-value is P(data | H₀), not P(H₁ | data). Swapping the two is the most expensive mistake in applied statistics.

It is NOT "the probability the result happened by chance"

This one sounds reasonable and is subtly circular. The p-value is computed in a world where the result did happen purely by chance — that's the assumption. So it can't also be the probability that chance was at work; that's baked in. What the p-value measures is: given chance is the only thing operating, how often does chance reach this far?

It is NOT 1 − P(alternative), nor the effect size

A p-value says nothing about how probable H₁ is, and nothing about how big an effect is. The next cell drives the last point home: keep the true effect identical and tiny, just grow the sample, and watch the p-value collapse toward zero. Same effect, wildly different p-values.

The effect never changed, yet the p-value swung from "not significant" to "astronomically significant" purely by collecting more data. If a number can do that while the underlying truth is fixed, it cannot be telling you how big the effect is. For size, you need an effect size and a confidence interval — the subjects of Effect Sizes and Confidence Intervals.

A non-significant p does NOT prove "no effect"

The mirror image of the above. A large p-value (say 0.40) means your data was unremarkable under H₀ — but as the example just showed, a real effect can easily produce a large p-value when n is small. "Not significant" means undetected, not absent. This is the same not-guilty-is-not-innocent point from Hypothesis Testing, and we'll quantify exactly when real effects go undetected in Errors and Power.

QuestionSelect one

A study reports p = 0.03. A colleague says: "So there's a 97% chance the effect is real." What is wrong with that?

Nothing — that is a fine way to read a p-value

The percentage should be 3%, not 97%

A p-value is $P( ext{data} mid H_0)$ , not $P( ext{effect is real} mid ext{data})$ — the two require different information and are not interchangeable

The colleague forgot to multiply by the sample size

How to read a p-value, in practice

Given all the ways to misread it, here is a clean decision flow that stays honest.

Notice that the p-value is never the end of the analysis. After a significant result you ask "how big?"; after a non-significant one you ask "did I have the power to see it?". The p-value opens the conversation; it does not close it.

The danger of looking many times: p-hacking

Here is the practical reason all of this matters. If H₀ is true, a p-value is essentially a uniform random number between 0 and 1. So if you test 20 independent things that are all truly null, you expect one of them to come up "significant" at α = 0.05 — by pure luck. Hunt through enough comparisons and you will always find a "discovery."

Run it and you'll typically see one or two "significant" results out of twenty — all of them false, by construction. This is p-hacking (also called the multiple-comparisons problem): test enough things, or peek repeatedly as data trickles in, and false positives are guaranteed. The fix is to plan your comparisons in advance and adjust for how many you make. We treat the cures in Statistical Fallacies.

Why this breaks naive A/B testing

The same trap appears when you "peek" at a running A/B test and stop the moment p dips below 0.05. Each peek is another chance for noise to cross the line, so your real false-positive rate climbs far above 5%. A p-value is only trustworthy when the number of looks and tests was fixed in advance.

Challenge 1 — Compute a p-value from a null distribution

You ran a one-sided test. The setup gives you null_stats — an array of test-statistic values simulated under $H_0$ — and a single observed_stat. Your alternative is "the statistic is larger than the null predicts," so more extreme means larger.

Compute the right-tailed p-value as the proportion of null_stats that are greater than or equal to observed_stat.
Store it as a float called p_value.

This is just an "area in the upper tail": count how many simulated values reach at least as far as what you saw, divided by the total.

Challenge 2 — A two-sided p-value

Now your alternative is just "the statistic is different from the null center" (which is 0 here), so more extreme means farther from zero in either direction.

Compute the two-sided p-value: the proportion of null_stats whose absolute value is greater than or equal to the absolute value of observed_stat.
Store it as a float called p_two_sided.

Hint: compare np.abs(null_stats) to abs(observed_stat). The two-sided p-value should be roughly double the one-sided one for a symmetric null.

Check your understanding

QuestionSelect one

Which statement is the correct definition of a p-value?

The probability that the null hypothesis is true given the data

The probability of obtaining data at least as extreme as observed, assuming the null hypothesis is true

The probability that the observed result was caused by random chance

The probability that the study will replicate

QuestionSelect one

A p-value of 0.02 was obtained. Which interpretation is acceptable?

There is a 2% chance the result is a fluke

There is a 98% chance the alternative hypothesis is true

If $H_0$ were true, data this extreme or more would occur about 2% of the time

The effect is large because the p-value is small

QuestionSelect one

Why is a p-value not equal to the probability that $H_0$ is true?

Because $H_0$ is never true in real data

Because p-values are always biased upward

Because the p-value is computed assuming $H_0$ , so it is $P( ext{data} mid H_0)$ , and reversing the condition to $P(H_0 mid ext{data})$ requires extra information it never uses

Because the sample size is usually too small

QuestionSelect one

You hold the true effect fixed and tiny, then increase the sample size from 50 to 50,000. The p-value drops from 0.30 to 0.0001. What does this demonstrate?

The effect got bigger as you collected more data

The p-value is unreliable and should be ignored

The p-value reflects evidence against $H_0$ , which grows with sample size even for a fixed small effect — so a p-value is not a measure of effect size

Larger samples always make effects more important

QuestionSelect one

A test gives p = 0.45. Which conclusion is the most defensible?

There is definitely no effect

The data did not provide significant evidence against $H_0$ ; a real effect may still exist but went undetected

The probability the null is true is 0.45

The effect size is 0.45

QuestionSelect one

A data scientist runs 40 independent A/B comparisons on metrics that truly have no effect, using $alpha = 0.05$ . About how many "significant" results should they expect, and why?

Zero, because none of the effects are real

About 2, because each truly-null test has a 5% chance of a false positive, and 5% of 40 is 2

All 40, because the tests are being misused

Exactly 1, because only one result can be significant at a time

QuestionSelect one

Which practice inflates the real false-positive rate above the stated $alpha$ ?

Fixing the sample size and the hypothesis before collecting data

Reporting the effect size alongside the p-value

Repeatedly peeking at a running experiment and stopping as soon as p dips below 0.05

Using a two-sided test instead of a one-sided test

Key takeaways

What to carry forward

A p-value is P(data this extreme or more | H₀) — a tail area under the null distribution, which you can get by simulation or by formula.
It is not the probability H₀ is true, not the probability the result is chance, not the probability you're wrong, not 1 − P(H₁), not the effect size, and not the probability of replication.
A small p-value says the data is surprising under H₀; it says nothing about how big the effect is — for that, see Effect Sizes and Confidence Intervals.
A large p-value means undetected, not absent — a question of power, covered in Errors and Power.
Testing many things or peeking repeatedly manufactures false positives (p-hacking); plan and adjust comparisons in advance, as in Statistical Fallacies.

P-values

On this page