Statistical Fallacies
The recurring reasoning traps that sink real analyses — Simpson's paradox, confounding, p-hacking, base-rate neglect, survivorship bias, regression to the mean, cherry-picking, and the Texas sharpshooter — what each one is, a concrete example, and how to avoid it.
Most bad data science isn't a coding bug. The code runs, the numbers are correct, the chart is beautiful — and the conclusion is wrong anyway, because the reasoning had a hole in it. These holes are old, named, and recurring. Learning their names is one of the highest-leverage things you can do, because once you can name a trap you start spotting it in the wild, in your colleagues' analyses, and — the hard part — in your own.
This page is a field guide to the classics. For each one: what it is, a concrete example, and how to avoid it. None of them require fancy math to understand. All of them have ended real projects, shipped real bad features, and made it into real headlines.
Simpson's paradox: a trend that reverses when you split the data
What it is. A relationship that holds in the aggregated data can reverse when you break the data into subgroups — or vice versa. The overall trend and the within-group trend point in opposite directions. It happens when a lurking variable is distributed unevenly across the groups you're comparing.
Concrete example. A classic real case: the 1973 UC Berkeley graduate admissions. Overall, men were admitted at a higher rate than women, which looked like bias against women. But department by department, women were admitted at equal or higher rates than men. The catch: women applied disproportionately to the most competitive departments (low admit rates for everyone), while men applied to easier ones. The department was the lurking variable.
Let's build a dataset where exactly this reversal happens, and watch it flip when we disaggregate.
Read the two tables against each other. Within every department, women are admitted more often — yet pooled together, men come out ahead, purely because women concentrated in the hard-to-enter department. Same numbers, opposite story.
How to avoid it. Before trusting an aggregate comparison, ask: is there a grouping variable distributed unevenly across the things I'm comparing? Break the data down by plausible lurking variables and check whether the story survives. If the aggregate and the subgroups disagree, the subgroups are usually the more honest answer — but which one is "right" depends on the causal question you're actually asking.
Simpson's paradox in dashboards
This is not an exotic textbook curiosity — it bites in routine work. "Conversion went up overall but down in every single segment" is a real thing that happens when traffic mix shifts. Whenever an overall metric moves, glance at the segment breakdown before you celebrate or panic.
A hospital reports a higher overall death rate than its rival, yet has a lower death rate for both "low-risk" and "high-risk" patients separately. What is the most likely explanation?
The hospital's data is wrong, since the numbers contradict each other
The hospital treats a larger share of high-risk patients, dragging up its overall rate despite better performance within each risk group
The hospital is genuinely worse, because the overall rate is what matters
Risk level is irrelevant to death rates
Confounding and correlation vs causation
What it is. Two variables move together (they correlate), and it's tempting to conclude one causes the other. But a confounder — a third variable Z that influences both X and Y — can manufacture the correlation with no causal link between X and Y at all.
Concrete example. Ice cream sales correlate with drowning deaths. Banning ice cream will not save swimmers — hot weather drives both: people buy more ice cream and swim more when it's hot. Weather is the confounder. The X–Y correlation is real; the causal story is fake.
How to avoid it. Treat "X correlates with Y" as the start of a question, never the answer. Ask what else could drive both. The gold standard for breaking confounding is a randomized experiment: randomly assigning who gets X severs the link from any confounder to X, which is exactly why A/B tests license causal claims (we lean on this in A/B Testing). Without randomization, control for confounders you can measure — but you can never be sure you've caught them all.
The phrase to delete from your vocabulary
"The data shows X causes Y" — from observational data alone, it almost never does. Correlation is necessary for causation but nowhere near sufficient. We explore this whole tangle in Correlation and Nonparametric Tests; here the point is just to make "could a confounder explain this?" an automatic reflex.
P-hacking, multiple comparisons, and the garden of forking paths
What it is. If you run one test at α = 0.05, there's a 5% chance of a false positive when nothing is real. But if you run twenty such tests, you'd expect about one "significant" result by pure chance. Hunting through many comparisons (or many model specifications, or many subgroups) and reporting only the ones that "worked" is p-hacking. The "garden of forking paths" is the subtler cousin: even without consciously fishing, the many small analysis choices you'd have made differently had the data looked different inflate your false-positive rate.
Concrete example. Test whether each of 20 unrelated variables
predicts churn. Even if none truly does, roughly one will come back
p < 0.05 — and if you publish that one as "the finding," you've
reported noise as signal. Let's watch it happen.
About one of twenty fires even though nothing is real. Report just that one and you've laundered chance into a "discovery."
How to avoid it.
- Pre-register your primary hypothesis and analysis before seeing the data, so the analysis isn't chosen to fit the noise.
- Correct for multiple comparisons (e.g., Bonferroni: use
α / m for
mtests; or control the false discovery rate). - Treat results found by exploration as hypotheses to confirm on fresh data, not conclusions — the central discipline of Exploratory Statistical Analysis.
Bonferroni in one line
Running m tests and want an overall 5% error rate? Compare each
p-value to 0.05 / m instead of 0.05. For 20 tests that's 0.0025 —
which would have killed every false positive in the simulation above.
It's conservative, but it's a safe default.
Base-rate neglect
What it is. Ignoring how rare something is when interpreting a test. Even an accurate test for a rare condition produces mostly false positives, because there are so many more true negatives than true positives to begin with. Forgetting the base rate makes a positive result feel far more conclusive than it is.
Concrete example. A disease affects 1 in 1,000 people. A test is 99% accurate (1% false-positive rate, catches all true cases). You test positive. Your chance of actually having the disease is not 99% — it's about 9%. Run the numbers on 100,000 people:
The 1,000 false positives drown out the 100 true ones. We derived this formally with Bayes' rule in Conditional Probability; the fallacy here is forgetting to do it at all.
How to avoid it. Always ask "how common is this in the first place?" before trusting a positive result. For rare events, a single positive is weak evidence — confirm with a second, independent test.
A fraud detector flags 2% of all transactions and is "95% accurate." Most transactions are legitimate. A flagged transaction is most likely to be:
Definitely fraud, since the detector is 95% accurate
More likely legitimate than fraudulent, because fraud is rare and false positives pile up
Impossible to assess without knowing the time of day
Equally likely to be fraud or legitimate
Survivorship bias
What it is. Drawing conclusions from only the cases that "survived" some selection, while the ones that didn't make it are invisible — silently dropped from your data. The survivors are a biased sample, so any pattern you find in them may be an artifact of what got filtered out, not a real effect.
Concrete example. In WWII, analysts examined returning bombers to decide where to add armor, and found bullet holes concentrated on the wings and tail. The instinct was to armor those spots. The statistician Abraham Wald pointed out the opposite: armor the untouched areas (engines, cockpit). The planes hit there never came back — so the holes you see mark the survivable hits. The data was literally the survivors.
How to avoid it. Ask "what's missing from this dataset, and why?" Whenever your data is the result of a selection ("successful startups," "customers who stayed," "students who graduated," "stocks still in the index"), the dropouts carry information you can't see. Mutual-fund "average returns" that quietly exclude funds that shut down are the modern, financial version of the bullet-hole map.
Regression to the mean
What it is. Extreme measurements tend to be followed by less extreme ones, simply because part of any extreme value was luck, and luck doesn't repeat. An unusually high reading is high partly because the random component happened to be high that time; next time, the random part is typically closer to average, so the value drifts back toward the mean — with no intervention required.
Concrete example. The "Sports Illustrated jinx": athletes featured on the cover often slump afterward. No curse — they made the cover after a career-best stretch (partly luck), and ordinary performance afterward looks like a decline. The same logic explains why "punishing the worst performers seems to work" and "rewarding the best seems to backfire": both groups were partly lucky/unlucky and drift back regardless.
The top group falls and the bottom group rises automatically. If you'd run a "coaching program" on the bottom 10%, you'd wrongly credit it for the rebound.
How to avoid it. Whenever you select a group because it was extreme and then re-measure, expect a drift toward the mean even if nothing was done. To attribute a change to an intervention, you need a control group that was equally extreme but untreated — only the gap between the groups isolates the real effect.
Regression to the mean masquerades as success
Interventions targeted at the worst cases (failing students, sick patients, underperforming stores) almost always look effective, because those cases would have improved on their own. Without a control group, you can't separate the treatment from the inevitable bounce-back.
Cherry-picking and the Texas sharpshooter
Cherry-picking is reporting only the data points, time windows, or subgroups that support your conclusion and quietly dropping the rest. "Sales are up!" — measured from the one low month you chose as the starting point.
The Texas sharpshooter fallacy is its formal cousin: fire a gun at a barn, then paint the target around the tightest cluster of holes, and declare yourself a marksman. In data terms — find a pattern in the data first, then construct the hypothesis to fit that exact pattern, and present it as if you'd predicted it. The cluster looks meaningful only because you drew the target after seeing where the bullets landed.
Concrete example. A "cancer cluster" near a factory looks alarming — but if you scan thousands of neighborhoods, some will show high counts by chance, and zooming in on the worst one after the fact (drawing the target around it) manufactures a story from noise. The same logic powers "this stock-picking strategy would have crushed the market" when the strategy was tuned on the very history it's tested against.
How to avoid it. State the hypothesis and the analysis window before looking, report all the comparisons you ran (not just the flattering ones), and confirm any data-discovered pattern on new, independent data. If the target was painted after the shots, the bullseye proves nothing.
The thread connecting half of these
Simpson's paradox, p-hacking, the garden of forking paths, cherry-picking, and the Texas sharpshooter are all variations on one theme: the analysis was shaped by the data instead of decided in advance. The antidote is always the same — decide first, then look, then confirm on fresh data. That discipline is the heart of Exploratory Statistical Analysis.
Practice
A DataFrame df of A/B test results by device has been created. Each row has device, group ("A" or "B"), visitors, and conversions.
Determine whether Simpson's paradox is present by comparing the aggregate winner to the within-device winner.
Produce two variables:
agg_winner: the string"A"or"B"— whichever group has the higher overall conversion rate (pooled across devices).within_winner: the string"A"or"B"— the group that has the higher conversion rate within both devices (the same group wins on each device in this data).
Conversion rate = conversions / visitors. A paradox exists when agg_winner != within_winner.
Simulate the multiple-comparisons trap. An outcome array (pure noise, no real signal) has been created.
Run 100 independent tests: for each, generate a fresh unrelated random feature of the same length, correlate it with outcome using scipy.stats.pearsonr, and count how many come back "significant" at alpha = 0.05.
- Use the provided
rngso results are reproducible. - Store the count of significant tests (p < 0.05) in an
intvariable namedfalse_positives.
Since nothing is real, every significant result is a false positive. Expect roughly 5 out of 100.
Check your understanding
Overall, your app's checkout conversion rose from last quarter, but it fell within every individual marketing channel. What happened?
The aggregate number is wrong and should be ignored
The mix of traffic shifted toward higher-converting channels, lifting the overall rate even as each channel declined
Conversion genuinely improved, so the channel breakdown is misleading
This is impossible without a bug
Cities with more police officers per capita tend to have more reported crime. A council member concludes police cause crime. The flaw is:
Survivorship bias
A confounder (city size / population density) drives both police numbers and crime, creating a non-causal correlation
Regression to the mean
Base-rate neglect
A team tests 40 features for an effect on retention at alpha = 0.05, finds 3 with p < 0.05, and reports those 3 as discoveries. What's the danger?
Nothing; p < 0.05 means each is a real effect
With 40 tests, about 2 false positives are expected by chance, so the 3 "hits" may be noise unless corrected and confirmed
The sample was too small
They should have used a one-sided test
A study of successful entrepreneurs finds most dropped out of college, and concludes dropping out boosts success. The fatal flaw is:
Survivorship bias — it only looks at successful people and ignores the far larger number of dropouts who failed
Regression to the mean
Simpson's paradox
P-hacking
A clinic enrolls the patients with the highest blood pressure, treats them, and finds their average pressure dropped at the next visit. Before crediting the treatment, the biggest concern is:
Base-rate neglect
Regression to the mean — the most extreme readings tend to fall toward average on re-measurement even with no treatment
Survivorship bias
Cherry-picking the data
An analyst scans 500 variables, finds one oddly predictive subgroup, and writes a hypothesis specifically describing that subgroup as if it had been predicted in advance. This is:
Confounding
Base-rate neglect
The Texas sharpshooter fallacy — drawing the target around the cluster after seeing where the shots landed
Regression to the mean
Key takeaways
- Simpson's paradox: an aggregate trend can reverse within subgroups — check for an uneven lurking variable.
- Confounding: a third variable can create a correlation with no causal link — randomization breaks it.
- P-hacking / multiple comparisons: run enough tests and chance hands you "significance" — correct and confirm.
- Base-rate neglect: for rare events, even accurate tests yield mostly false positives.
- Survivorship bias: the cases that dropped out carry information you can't see — ask what's missing.
- Regression to the mean: extremes drift back on their own — you need a control group to credit an intervention.
- Cherry-picking / Texas sharpshooter: decide the hypothesis before you look, report everything, and confirm on new data.
Effect Sizes
Why a p-value tells you whether an effect exists but never how big it is — and how Cohen's d, correlation r, and risk ratios measure the size that actually drives decisions, always paired with a confidence interval.
Exploratory Statistical Analysis
A disciplined EDA workflow that uses statistics to understand distributions, relationships, missingness, and outliers — while keeping a hard wall between exploration that generates hypotheses and confirmation that tests them on fresh data.