Dataslope logoDataslope

A/B Testing

The capstone — design a real experiment with a clear hypothesis, one primary metric, randomization, and power up front; then analyze it fully with a test, a lift, a confidence interval, and an effect size; and decide to ship or not by weighing significance against practical importance.

This is where the whole course comes together. An A/B test (a randomized controlled experiment) is the single most powerful tool a data scientist has for answering "does this change actually work?" — and it runs on every concept you've built: populations and samples, randomization, sampling distributions, confidence intervals, hypothesis tests, p-values, power, and effect sizes. If you can design and read an A/B test correctly, you can be trusted with real decisions.

We'll go in the order a real experiment does: design it before you collect a single row, analyze it once it's done, and decide whether to ship — then catalog the pitfalls that quietly invalidate otherwise-fine tests.

Design comes first (and is where most tests are won or lost)

The analysis is the easy part. A test that wasn't designed properly can't be rescued by clever statistics afterward. Four decisions, all made before data collection:

1. A single, falsifiable hypothesis. "The new checkout button increases the purchase-conversion rate." Not "the new button is better" (better at what?) — name the exact change and the exact outcome.

2. One primary metric, chosen in advance. Pick the number the decision hinges on — say, conversion rate — and commit to it. Tracking ten metrics and celebrating whichever one moves is the multiple-comparisons trap from Statistical Fallacies wearing a business suit.

3. Randomization. Assign each user to control (A) or treatment (B) at random. This is the load-bearing wall of the whole method, so it gets its own section below.

4. Sample size / power, up front. Decide how many users you need before starting, so the test can actually detect an effect worth caring about. More on this two sections down.

Primary metric is singular for a reason

The moment you have five "primary" metrics, you're running five tests, and the chance that at least one looks significant by luck balloons. Declare one primary metric for the ship decision; everything else is a secondary/guardrail metric you look at for context, not for the verdict.

Why randomization licenses causal claims

Here is the deep reason A/B tests can say "X caused Y" when ordinary observational data (from Statistical Fallacies) cannot. Randomly assigning who gets the treatment makes the two groups statistically identical on everything — measured and unmeasured — except the change you're testing. Age, device, loyalty, the weather, mood, and every confounder you never thought of are, on average, balanced across the two arms. So if B outperforms A by more than chance allows, the only systematic difference between the groups is the treatment. That severs the confounder's arrow and earns you a causal conclusion.

Randomization vs. observation

Without randomization, "users who saw the new button converted more" could just mean the new button shipped to your most engaged users. With randomization, the engaged users are split evenly between A and B, so that explanation is off the table. Randomization is what turns a correlation into a credible cause.

Power and sample size, decided before you start

You met statistical power in Errors and Power: the probability your test detects a real effect of a given size. Run a test on too few users and even a genuine, worthwhile lift can come back non-significant — you "found nothing" only because you lacked the data to find it. So you size the experiment up front, around the smallest effect worth detecting (the minimum detectable effect, MDE).

The four knobs trade off against each other:

KnobEffect on required sample size
Smaller effect to detect (MDE)needs more users
Higher power (e.g., 0.9 vs 0.8)needs more users
Smaller α (stricter)needs more users
Noisier metric (bigger variance)needs more users
Code Block
Python 3.13.2

Why size up front, not after

If you decide the sample size after peeking at results, you've turned a clean test into a fishing expedition — you'll be tempted to stop the moment the numbers look good. Committing to n in advance is what keeps the false-positive rate at the α you intended.

Analyze: simulate and fully read one experiment

Let's run a complete analysis. We'll simulate an experiment with a known true effect (so we can check our work), then analyze it the way you would real data: compute each arm's conversion rate, the observed lift, a confidence interval on the difference, a formal test, the p-value, and an effect size.

For a conversion (yes/no) metric, the natural test compares two proportions. We'll use a two-proportion z-test computed directly, and cross-check it against a chi-square test on the 2×2 table — they answer the same question.

Code Block
Python 3.13.2

Look at the full picture, not just the p-value. You get a point estimate of the lift, a confidence interval that shows how precise it is, a p-value for "is it distinguishable from no effect," and an effect size. That's the reporting trio from Effect Sizes, applied.

A chart makes the comparison and its uncertainty legible. We'll show each arm's conversion rate with an error bar (the CI half-width).

Code Block
Python 3.13.2

Continuous metric? Use a t-test instead

If your primary metric is continuous — revenue per user, session length, time-to-checkout — swap the two-proportion test for an independent t-test (scipy.stats.ttest_ind), report the difference in means with its CI, and use Cohen's d as the effect size. The design, randomization, power, and decision logic are identical; only the test and effect-size formula change.

Decide: ship or don't ship

A significant p-value is permission to look at the effect size, not an order to ship. The decision weighs two axes together — exactly the significance-vs-importance distinction from Effect Sizes.

Walk the branches:

  • Not significant. No green light. But check power first — a null result on a tiny sample means "we couldn't tell," not "no effect."
  • Significant but tiny. The classic trap. With enough traffic a 0.05pp lift is "significant" and worthless. Look at the effect size and whether the whole confidence interval clears the threshold you'd care about.
  • Significant and meaningful. Ship — and ideally confirm the CI's lower bound is still above the level that justifies the rollout cost.

The most expensive A/B testing mistake

"It's statistically significant, so we shipped it." Significance with a huge sample can certify a lift far too small to pay for the engineering, the risk, or the added complexity. Always read the effect size and the confidence interval before shipping. A narrow CI sitting entirely above your "worth it" line is the real green light.

Pitfalls that quietly invalidate a test

Even a well-designed test can be ruined by a handful of classic mistakes. Each one inflates false positives or biases the estimate.

  • Peeking and early stopping. Watching the p-value daily and stopping the instant it dips below 0.05 dramatically inflates your false-positive rate. The p-value wanders; if you stop on the first lucky dip, you stop on noise. Decide n (or use a proper sequential-testing method) and run to it.
  • Multiple metrics / multiple variants. Every extra metric or arm is another shot at a false positive. Correct for it (e.g., Bonferroni) or pre-commit to one primary metric.
  • Novelty effect. Users click the new thing because it's new; the bump fades. Run long enough to see whether the lift persists past the novelty window.
  • Simpson's paradox across segments. An overall win can hide a loss in every segment (or vice versa) if your traffic mix shifted — straight from Statistical Fallacies. Sanity-check key segments.
  • Unequal / broken groups (assignment bias). If the arms differ in size or composition more than randomization predicts — a "sample ratio mismatch" — your randomization is broken and the comparison is suspect.

Peeking is sneakier than it sounds

It feels harmless to "just check" the dashboard and stop when it hits significance. It isn't. Because the p-value fluctuates over time, a test of two identical arms will cross p < 0.05 at some point surprisingly often if you keep peeking. Pre-commit to your stopping rule, or use a method designed for continuous monitoring.

Practice

Challenge
Python 3.13.2
Lift and a two-proportion test

You're handed the final counts from an A/B test:

  • Control: conv_a conversions out of n_a visitors.
  • Treatment: conv_b conversions out of n_b visitors.

Compute two values:

  • rel_lift: the relative lift in conversion rate, (p_b - p_a) / p_a, as a float (e.g. 0.20 for a 20% relative lift).
  • p_value: the two-sided p-value from a two-proportion z-test (pooled standard error under H0), as a float.

Steps for the z-test:

  • p_pool = (conv_a + conv_b) / (n_a + n_b)
  • se = sqrt(p_pool*(1-p_pool)*(1/n_a + 1/n_b))
  • z = (p_b - p_a) / se, then p_value = 2*(1 - norm.cdf(abs(z))).
Challenge
Python 3.13.2
Confidence interval for the difference in rates

Using the same kind of A/B counts, build a 95% confidence interval for the difference in conversion rates (treatment minus control), using the unpooled standard error.

  • p_a = conv_a/n_a, p_b = conv_b/n_b; the difference is p_b - p_a.
  • se = sqrt( p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b )
  • Critical value: stats.norm.ppf(0.975).
  • Store the endpoints in a tuple ci = (low, high) of two floats with low < high.

Then set a boolean excludes_zero to True if the whole interval is above 0 (a significant positive difference), else False.

Check your understanding

QuestionSelect one

Why does randomly assigning users to A and B let an A/B test support a causal claim, when ordinary observational data usually can't?

Because randomization guarantees the two groups have identical sample sizes

Because randomization balances confounders — known and unknown — across the two groups, so the treatment is the only systematic difference

Because randomization increases the sample size

Because randomized tests never produce false positives

QuestionSelect one

A test reaches p = 0.04 after 3 days. Your teammate wants to stop and ship immediately. What's the statistical problem with stopping the moment it crosses 0.05?

Nothing — once p < 0.05 the result is locked in and valid

Peeking and stopping at the first significant moment inflates the false-positive rate well above the stated alpha

The p-value is too high to stop

You should switch to a one-sided test to justify stopping

QuestionSelect one

With 5 million users, a new layout shows a statistically significant lift in conversion of 0.02 percentage points (from 8.00% to 8.02%). The 95% CI is [0.005pp, 0.035pp]. Should you ship?

Yes — it's statistically significant, and significance means ship

Probably not on this alone — the effect is real but so small it likely isn't worth the cost; weigh the tiny effect size against the rollout cost

Yes, because the confidence interval excludes zero

No, because the result must be a false positive

QuestionSelect one

Your team's experiment tracks 12 metrics and declares victory because one of them improved with p = 0.03. What's the flaw?

A t-test was the wrong choice for 12 metrics

Testing 12 metrics multiplies the chance that at least one looks significant by luck; without correction or a pre-declared primary metric, the "win" may be noise

p = 0.03 is not significant enough

They should have used a larger sample

QuestionSelect one

An A/B test shows the treatment winning overall, but when you split by device, the treatment loses on both mobile and desktop. What's the most likely culprit?

The treatment is genuinely better, so ignore the segments

The test had too few users

Simpson's paradox — the traffic mix differs between arms, so the pooled result reverses what happens within each device

The metric was continuous instead of binary

Key takeaways

  • Design before you collect: one falsifiable hypothesis, one primary metric, randomization, and a power-based sample size — all up front.
  • Randomization balances confounders across arms, which is what licenses a causal conclusion.
  • Analyze fully: report the lift, a confidence interval on the difference, an effect size, and the p-value — never the p-value alone.
  • Decide on two axes: ship only when the result is significant and the effect is practically large enough (CI comfortably past your threshold).
  • Beware the pitfalls: peeking/early stopping, multiple metrics, novelty effects, Simpson's paradox across segments, and broken (unequal) groups.

On this page