Hypothesis Testing
The logic of testing a claim under uncertainty — assume the skeptical null, measure how surprising your data would be if it were true, and decide whether to reject it.
You ran an experiment. The new checkout flow converted at 11.4%, the old one at 10.1%. The new flow looks better. Should you ship it?
Here is the trap. You already saw in Why Statistics Matters that two groups drawn from the exact same process almost never produce the exact same average. Random noise manufactures differences for free. So "11.4% beats 10.1%" is not, by itself, evidence of anything. The real question is the one this whole page is about:
If there were truly no difference, how often would noise alone hand me a gap this big or bigger?
Hypothesis testing is the disciplined procedure for answering that. It is the engine underneath A/B tests, clinical trials, quality control, and almost every "is this real?" decision a data scientist makes.
The core idea: assume nothing is going on, then look for a contradiction
Hypothesis testing flips the burden of proof. Instead of trying to prove your exciting idea is true, you start by assuming the boring explanation and check whether the data is hard to reconcile with it.
The boring explanation has a name: the null hypothesis, written H₀. It is the skeptic's default — "nothing is going on, the difference is just chance, the coin is fair, the drug does nothing." Against it stands the alternative hypothesis, H₁ (sometimes Hₐ) — "there is an effect."
| Hypothesis | What it says | Examples |
|---|---|---|
| H₀ (null) | No effect, no difference, status quo | "The two flows convert equally." "μ = 100." |
| H₁ (alternative) | There is an effect or difference | "The new flow converts better." "μ ≠ 100." |
The whole procedure is a proof by contradiction under uncertainty. We provisionally believe H₀, then ask: if H₀ were true, would data like mine be surprising? If it would be very surprising, we conclude H₀ is probably wrong and reject it. If the data is perfectly ordinary under H₀, we have no grounds to abandon it — we fail to reject it.
Why start from the skeptical hypothesis?
Science and analytics work this way for the same reason courts presume innocence: it is much easier to gather evidence that contradicts a specific claim ("nothing is going on") than to directly prove a vague one ("something is going on"). H₀ is specific enough to compute with — it gives us an exact picture of what "just noise" looks like.
The courtroom analogy
If you remember one mental model from this page, make it this one. A hypothesis test is a trial, and the null hypothesis is the defendant.
The mapping is exact, and the most important cell is the bottom-right one:
- Presumption of innocence = we assume H₀ until the data forces us off it.
- Beyond reasonable doubt = the significance level α, the bar the evidence must clear.
- "Guilty" = reject H₀.
- "Not guilty" = fail to reject H₀ — and crucially, a "not guilty" verdict does not declare the defendant innocent. It says the prosecution did not meet the bar. The defendant may well have done it; the evidence just was not strong enough.
The single biggest misconception: 'not guilty' is not 'innocent'
Failing to reject H₀ does not prove H₀ is true. It means your data was consistent with H₀ — but it would often also be consistent with a small real effect you did not have enough data to detect. "We found no significant difference" never means "there is no difference." It means "we could not rule out no difference." We will hammer this again in Errors and Power, where you'll see exactly how a real effect can hide behind a non-significant result.
The five steps of every test
Whatever the scenario — comparing two means, testing a proportion, checking a correlation — the skeleton is identical.
Let's unpack the two pieces that trip people up.
The test statistic: collapsing your data into one surprising-or-not number
A test statistic is a single number that measures how far your data sits from what H₀ predicts, in units of "ordinary noise." For comparing two means it's the t-statistic, roughly:
t = (observed difference) / (noise in that difference)
A big t (far from zero) means the observed gap dwarfs the typical
random wiggle — surprising under H₀. A small t means the gap is the
kind of thing noise produces all the time. The test statistic is just a
ruler for surprise.
The significance level α: the bar, chosen before you look
α is the threshold for "too surprising to be coincidence." By convention it's often 0.05, meaning: if H₀ were true, I am willing to wrongly reject it 5% of the time. You choose α before seeing the data — it encodes how much risk of a false alarm you'll tolerate, not something you tune to get the answer you want.
Set α before you peek — never after
Choosing α (or your hypotheses) after seeing the data is a form of cheating. If you let the data pick the threshold, you can always find a story that clears the bar. Decide the rules of the game first, then play. We will see in P-values how "looking until it's significant" silently inflates your false-positive rate.
A complete worked example, end to end
A nutrition app claims its users average 8,000 steps a day. You suspect the real average is different. You pull a random sample of 40 users. Let's run the entire procedure.
Step 1 — Question. Is the true average daily step count different from 8,000?
Step 2 — Hypotheses. This is a one-sample setup comparing a mean to a fixed number:
- H₀: μ = 8000 (the claim holds)
- H₁: μ ≠ 8000 (the average is different — two-sided, because "different" could be higher or lower)
Step 3 — Significance level. α = 0.05, fixed now, before we compute anything.
Steps 4 and 5 — Compute and decide:
Interpretation in plain language. The p-value is the probability of seeing a sample mean at least this far from 8,000 if the true average really were 8,000. If it lands below 0.05, we say: "a gap this large would be unusual under the claim, so we doubt the claim." If it lands above 0.05, we say: "this is an ordinary amount of wiggle; we have no grounds to dispute 8,000." Notice what we never say: we don't prove the average is 8,000, and we don't say how much it differs. The test answers one narrow question — "is the data surprising under H₀?" — and nothing more.
Read the verdict out loud the right way
Practice phrasing decisions like a careful analyst:
- Reject: "There is statistically significant evidence that the average differs from 8,000 (p = 0.03)."
- Fail to reject: "We did not find significant evidence that the average differs from 8,000 (p = 0.21)."
Never: "We proved the average is 8,000." Never: "There is no difference."
Seeing why a difference can be "not significant"
Run the next cell. Two groups are drawn from the identical process (no real effect at all), yet their means differ — and the test correctly fails to reject H₀ most of the time, because the gap is the size of ordinary noise.
You will usually see "fail to reject" — exactly what should happen when H₀ is true. But run it enough and a "reject" sneaks through. That occasional false alarm is the α = 0.05 risk made visible, and it's the subject of Errors and Power.
One-sided vs two-sided tests
Your alternative hypothesis comes in two flavors, and the choice changes where you look for surprise.
- Two-sided (H₁: μ ≠ 8000): you care about a difference in either direction. Surprise lives in both tails. This is the safe default.
- One-sided (H₁: μ > 8000, or μ < 8000): you care about only one direction, decided before seeing data, for a substantive reason ("the drug can only help, or do nothing — it won't hurt").
Don't switch to one-sided just to win
A one-sided test makes it easier to reach significance in the chosen
direction — which is exactly why flipping to one-sided after seeing the
data trends your way is a classic form of p-hacking. Pick the side for a
scientific reason, in advance, or stay two-sided. scipy.stats tests are
two-sided by default; you opt into one-sided with alternative='greater'
or alternative='less'.
A team analyzes their A/B test, sees the treatment did slightly better, and then runs a one-sided test (alternative = "greater") because it gives a smaller p-value than the two-sided test they had planned. Why is this a problem?
One-sided tests are never valid and should not be used
The direction was chosen after seeing the data, which inflates the false-positive rate beyond the stated alpha
Two-sided tests are always more powerful, so switching loses information
The p-value should be multiplied by two when going one-sided
When to use a hypothesis test — and when not to
Hypothesis testing is a precision tool, not a hammer for every nail.
Reach for it when:
- You have a specific yes/no claim to evaluate against a noisy sample ("did the new flow change conversion?").
- The decision is binary — ship or don't, investigate or drop it.
- You can state H₀ and H₁ before collecting data.
Be cautious or look elsewhere when:
- You mainly want to know how big an effect is, not just whether it's nonzero — there, a confidence interval and an effect size tell you far more (see Confidence Intervals and Effect Sizes).
- You're exploring data for patterns with no pre-specified hypothesis — testing dozens of things and keeping the "significant" ones is a recipe for false discoveries (Statistical Fallacies).
- The sample is so large that any trivial difference becomes "significant." Significance and importance are different questions.
'Statistically significant' is not 'large' or 'important'
A test can only tell you whether an effect is distinguishable from zero, not whether it's big enough to matter. With 5 million users, a 0.001% lift in conversion will be wildly "significant" and completely irrelevant to the business. Rejecting H₀ says "probably not exactly zero" — it says nothing about effect size. Always pair a test with a confidence interval or an effect size to learn how much.
Challenge 1 — Make the decision
You are handed a pre-registered significance level and a p-value from a completed test. Apply the decision rule.
- A variable
alpha(the significance level) andp_valueare provided in the setup. - Produce a string variable
decisionthat is exactly"reject H0"if the result is statistically significant, and exactly"fail to reject H0"otherwise. - Use the standard rule: reject when
p_value <= alpha.
Remember the convention: when the p-value is on the boundary (equal to alpha), we reject.
Challenge 2 — Run the test yourself
A call center A/B tests a new script. You have handle times (in minutes) for the control and treatment groups. Test whether the two group means differ.
- Use
scipy.statsto run a two-sample, two-sided t-test comparingcontrolandtreatment. - Store the test statistic as a float
t_statand the p-value as a floatp_value. - Then set a boolean
significantto whether the result is significant atalpha = 0.05(i.e.p_value <= alpha).
You do not need to compute anything by hand — pick the right scipy.stats function and read off its two outputs.
Common misconceptions, gathered
Five things hypothesis testing does NOT do
- It does not prove H₀. "Fail to reject" means insufficient evidence, not "no effect." (The not-guilty verdict.)
- It does not give you the probability that H₀ is true. The p-value assumes H₀ and measures the data — it is not P(H₀ | data). That's the focus of P-values.
- It does not measure effect size. A tiny p-value can come from a huge sample detecting a microscopic effect.
- Significance is not importance. "Distinguishable from zero" is not "big enough to act on."
- You can't choose α or the hypotheses after peeking. The rules are set before the data is seen.
Check your understanding
What is the null hypothesis in a typical A/B test comparing conversion rates?
The new variant has a higher conversion rate than the control
The two variants have the same conversion rate — any observed difference is just chance
The experiment was run correctly
The sample size is large enough to detect an effect
A test yields p = 0.42 at . Which statement is the most accurate conclusion?
We have proven there is no effect
We failed to reject ; the data did not provide significant evidence of an effect, but an effect could still exist undetected
The probability the null is true is 0.42
We should accept as true
Why do we fix the significance level before collecting data?
Because must always equal 0.05 by law
Because the p-value cannot be computed otherwise
Because choosing the threshold after seeing the data lets you tune it to get the verdict you want, which destroys the error guarantee
Because a smaller always gives a better test
A study on 4 million users finds the new homepage increases time-on-site by 0.3 seconds, p < 0.0001. What is the right takeaway?
The effect is enormous because the p-value is so tiny
The result is invalid because the sample is too big
The effect is almost certainly real but likely too small to matter; significance and practical importance are different questions
We should reject the result and gather a smaller sample
In the courtroom analogy, what does "fail to reject " correspond to, and what does it mean?
A "guilty" verdict — the defendant definitely did it
A "not guilty" verdict — the evidence did not meet the bar, which is not the same as declaring innocence
A mistrial — the test could not be run
Proof of innocence — is established as true
You want to test whether a new fertilizer changes crop yield (it could plausibly help or hurt). Which test setup is appropriate?
A one-sided test with , because you hope it helps
A two-sided test with $H_1: mu
Key takeaways
What to carry forward
- A hypothesis test asks one question: would data like mine be surprising if the skeptical null H₀ were true?
- Assume H₀, compute a test statistic, compare its p-value to a pre-chosen α, then reject or fail to reject.
- Fail to reject is not "accept." Absence of evidence is not evidence of absence (the not-guilty verdict).
- Choose α and your hypotheses before seeing the data; pick two-sided unless a real constraint justifies one-sided.
- A test tells you whether an effect is distinguishable from zero — never how big it is. For magnitude, you need Confidence Intervals and Effect Sizes.
- The mechanics of "how surprising?" are the subject of P-values, and the two ways a test can be wrong are the subject of Errors and Power.
The Bootstrap
Resampling your own data with replacement to approximate a sampling distribution — getting standard errors and confidence intervals for awkward statistics like the median, with no formulas and no normality assumptions.
P-values
The single most misunderstood number in statistics — what a p-value actually measures, built visually as a tail area under the null, and the long list of things it is not.