A/B Testing
The capstone — design a real experiment with a clear hypothesis, one primary metric, randomization, and power up front; then analyze it fully with a test, a lift, a confidence interval, and an effect size; and decide to ship or not by weighing significance against practical importance.
This is where the whole course comes together. An A/B test (a randomized controlled experiment) is the single most powerful tool a data scientist has for answering "does this change actually work?" — and it runs on every concept you've built: populations and samples, randomization, sampling distributions, confidence intervals, hypothesis tests, p-values, power, and effect sizes. If you can design and read an A/B test correctly, you can be trusted with real decisions.
We'll go in the order a real experiment does: design it before you collect a single row, analyze it once it's done, and decide whether to ship — then catalog the pitfalls that quietly invalidate otherwise-fine tests.
Design comes first (and is where most tests are won or lost)
The analysis is the easy part. A test that wasn't designed properly can't be rescued by clever statistics afterward. Four decisions, all made before data collection:
1. A single, falsifiable hypothesis. "The new checkout button increases the purchase-conversion rate." Not "the new button is better" (better at what?) — name the exact change and the exact outcome.
2. One primary metric, chosen in advance. Pick the number the decision hinges on — say, conversion rate — and commit to it. Tracking ten metrics and celebrating whichever one moves is the multiple-comparisons trap from Statistical Fallacies wearing a business suit.
3. Randomization. Assign each user to control (A) or treatment (B) at random. This is the load-bearing wall of the whole method, so it gets its own section below.
4. Sample size / power, up front. Decide how many users you need before starting, so the test can actually detect an effect worth caring about. More on this two sections down.
Primary metric is singular for a reason
The moment you have five "primary" metrics, you're running five tests, and the chance that at least one looks significant by luck balloons. Declare one primary metric for the ship decision; everything else is a secondary/guardrail metric you look at for context, not for the verdict.
Why randomization licenses causal claims
Here is the deep reason A/B tests can say "X caused Y" when ordinary observational data (from Statistical Fallacies) cannot. Randomly assigning who gets the treatment makes the two groups statistically identical on everything — measured and unmeasured — except the change you're testing. Age, device, loyalty, the weather, mood, and every confounder you never thought of are, on average, balanced across the two arms. So if B outperforms A by more than chance allows, the only systematic difference between the groups is the treatment. That severs the confounder's arrow and earns you a causal conclusion.
Randomization vs. observation
Without randomization, "users who saw the new button converted more" could just mean the new button shipped to your most engaged users. With randomization, the engaged users are split evenly between A and B, so that explanation is off the table. Randomization is what turns a correlation into a credible cause.
Power and sample size, decided before you start
You met statistical power in Errors and Power: the probability your test detects a real effect of a given size. Run a test on too few users and even a genuine, worthwhile lift can come back non-significant — you "found nothing" only because you lacked the data to find it. So you size the experiment up front, around the smallest effect worth detecting (the minimum detectable effect, MDE).
The four knobs trade off against each other:
| Knob | Effect on required sample size |
|---|---|
| Smaller effect to detect (MDE) | needs more users |
| Higher power (e.g., 0.9 vs 0.8) | needs more users |
| Smaller α (stricter) | needs more users |
| Noisier metric (bigger variance) | needs more users |
Why size up front, not after
If you decide the sample size after peeking at results, you've turned a
clean test into a fishing expedition — you'll be tempted to stop the
moment the numbers look good. Committing to n in advance is what keeps
the false-positive rate at the α you intended.
Analyze: simulate and fully read one experiment
Let's run a complete analysis. We'll simulate an experiment with a known true effect (so we can check our work), then analyze it the way you would real data: compute each arm's conversion rate, the observed lift, a confidence interval on the difference, a formal test, the p-value, and an effect size.
For a conversion (yes/no) metric, the natural test compares two proportions. We'll use a two-proportion z-test computed directly, and cross-check it against a chi-square test on the 2×2 table — they answer the same question.
Look at the full picture, not just the p-value. You get a point estimate of the lift, a confidence interval that shows how precise it is, a p-value for "is it distinguishable from no effect," and an effect size. That's the reporting trio from Effect Sizes, applied.
A chart makes the comparison and its uncertainty legible. We'll show each arm's conversion rate with an error bar (the CI half-width).
Continuous metric? Use a t-test instead
If your primary metric is continuous — revenue per user, session
length, time-to-checkout — swap the two-proportion test for an
independent t-test (scipy.stats.ttest_ind), report the difference
in means with its CI, and use Cohen's d as the effect size. The
design, randomization, power, and decision logic are identical; only the
test and effect-size formula change.
Decide: ship or don't ship
A significant p-value is permission to look at the effect size, not an order to ship. The decision weighs two axes together — exactly the significance-vs-importance distinction from Effect Sizes.
Walk the branches:
- Not significant. No green light. But check power first — a null result on a tiny sample means "we couldn't tell," not "no effect."
- Significant but tiny. The classic trap. With enough traffic a 0.05pp lift is "significant" and worthless. Look at the effect size and whether the whole confidence interval clears the threshold you'd care about.
- Significant and meaningful. Ship — and ideally confirm the CI's lower bound is still above the level that justifies the rollout cost.
The most expensive A/B testing mistake
"It's statistically significant, so we shipped it." Significance with a huge sample can certify a lift far too small to pay for the engineering, the risk, or the added complexity. Always read the effect size and the confidence interval before shipping. A narrow CI sitting entirely above your "worth it" line is the real green light.
Pitfalls that quietly invalidate a test
Even a well-designed test can be ruined by a handful of classic mistakes. Each one inflates false positives or biases the estimate.
- Peeking and early stopping. Watching the p-value daily and stopping
the instant it dips below 0.05 dramatically inflates your false-positive
rate. The p-value wanders; if you stop on the first lucky dip, you stop
on noise. Decide
n(or use a proper sequential-testing method) and run to it. - Multiple metrics / multiple variants. Every extra metric or arm is another shot at a false positive. Correct for it (e.g., Bonferroni) or pre-commit to one primary metric.
- Novelty effect. Users click the new thing because it's new; the bump fades. Run long enough to see whether the lift persists past the novelty window.
- Simpson's paradox across segments. An overall win can hide a loss in every segment (or vice versa) if your traffic mix shifted — straight from Statistical Fallacies. Sanity-check key segments.
- Unequal / broken groups (assignment bias). If the arms differ in size or composition more than randomization predicts — a "sample ratio mismatch" — your randomization is broken and the comparison is suspect.
Peeking is sneakier than it sounds
It feels harmless to "just check" the dashboard and stop when it hits significance. It isn't. Because the p-value fluctuates over time, a test of two identical arms will cross p < 0.05 at some point surprisingly often if you keep peeking. Pre-commit to your stopping rule, or use a method designed for continuous monitoring.
Practice
You're handed the final counts from an A/B test:
- Control:
conv_aconversions out ofn_avisitors. - Treatment:
conv_bconversions out ofn_bvisitors.
Compute two values:
rel_lift: the relative lift in conversion rate,(p_b - p_a) / p_a, as a float (e.g. 0.20 for a 20% relative lift).p_value: the two-sided p-value from a two-proportion z-test (pooled standard error under H0), as a float.
Steps for the z-test:
p_pool = (conv_a + conv_b) / (n_a + n_b)se = sqrt(p_pool*(1-p_pool)*(1/n_a + 1/n_b))z = (p_b - p_a) / se, thenp_value = 2*(1 - norm.cdf(abs(z))).
Using the same kind of A/B counts, build a 95% confidence interval for the difference in conversion rates (treatment minus control), using the unpooled standard error.
p_a = conv_a/n_a,p_b = conv_b/n_b; the difference isp_b - p_a.se = sqrt( p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b )- Critical value:
stats.norm.ppf(0.975). - Store the endpoints in a tuple
ci = (low, high)of two floats withlow < high.
Then set a boolean excludes_zero to True if the whole interval is above 0 (a significant positive difference), else False.
Check your understanding
Why does randomly assigning users to A and B let an A/B test support a causal claim, when ordinary observational data usually can't?
Because randomization guarantees the two groups have identical sample sizes
Because randomization balances confounders — known and unknown — across the two groups, so the treatment is the only systematic difference
Because randomization increases the sample size
Because randomized tests never produce false positives
A test reaches p = 0.04 after 3 days. Your teammate wants to stop and ship immediately. What's the statistical problem with stopping the moment it crosses 0.05?
Nothing — once p < 0.05 the result is locked in and valid
Peeking and stopping at the first significant moment inflates the false-positive rate well above the stated alpha
The p-value is too high to stop
You should switch to a one-sided test to justify stopping
With 5 million users, a new layout shows a statistically significant lift in conversion of 0.02 percentage points (from 8.00% to 8.02%). The 95% CI is [0.005pp, 0.035pp]. Should you ship?
Yes — it's statistically significant, and significance means ship
Probably not on this alone — the effect is real but so small it likely isn't worth the cost; weigh the tiny effect size against the rollout cost
Yes, because the confidence interval excludes zero
No, because the result must be a false positive
Your team's experiment tracks 12 metrics and declares victory because one of them improved with p = 0.03. What's the flaw?
A t-test was the wrong choice for 12 metrics
Testing 12 metrics multiplies the chance that at least one looks significant by luck; without correction or a pre-declared primary metric, the "win" may be noise
p = 0.03 is not significant enough
They should have used a larger sample
An A/B test shows the treatment winning overall, but when you split by device, the treatment loses on both mobile and desktop. What's the most likely culprit?
The treatment is genuinely better, so ignore the segments
The test had too few users
Simpson's paradox — the traffic mix differs between arms, so the pooled result reverses what happens within each device
The metric was continuous instead of binary
Key takeaways
- Design before you collect: one falsifiable hypothesis, one primary metric, randomization, and a power-based sample size — all up front.
- Randomization balances confounders across arms, which is what licenses a causal conclusion.
- Analyze fully: report the lift, a confidence interval on the difference, an effect size, and the p-value — never the p-value alone.
- Decide on two axes: ship only when the result is significant and the effect is practically large enough (CI comfortably past your threshold).
- Beware the pitfalls: peeking/early stopping, multiple metrics, novelty effects, Simpson's paradox across segments, and broken (unequal) groups.
Exploratory Statistical Analysis
A disciplined EDA workflow that uses statistics to understand distributions, relationships, missingness, and outliers — while keeping a hard wall between exploration that generates hypotheses and confirmation that tests them on fresh data.
Next Steps
A reflective recap of the journey from statistical thinking to applied inference, an honest map of where to go next — regression, designed experiments, causal inference, Bayesian methods, and machine learning — and the durable habits worth keeping for the rest of your career.