ANOVA and Chi-Square
Two essential tests beyond the t-test — one-way ANOVA for comparing the means of three or more groups, and chi-square tests for categorical data (independence and goodness-of-fit), with the intuition, the assumptions, and how to read the results.
The t-test compares two means. But real questions rarely stop at two. Does conversion differ across four landing pages? Do delivery times differ across three warehouses? And what about questions that aren't about means at all — is device type related to whether a user churns? Those are counts of categories, not averages.
This page covers the two tools that handle these cases:
- One-way ANOVA — compares the means of 3+ groups at once.
- Chi-square tests — work with categorical data: are two categorical variables associated (independence), and does a category's counts match an expected pattern (goodness-of-fit)?
Both are one-liners in scipy.stats. As always, the value is in picking
the right one and reading the result honestly.
Part A — One-way ANOVA: comparing 3+ group means
Why not just run a bunch of t-tests?
The tempting move with four groups is to t-test every pair: A vs B, A vs C, A vs D, B vs C, B vs D, C vs D — six tests. The problem is the multiplicity of chances to be wrong. Each test at α = 0.05 carries a 5% false-positive risk. Run six and the chance that at least one fires by pure luck climbs well above 5%.
That inflation is the heart of the multiple comparisons problem — we'll formalize it in Errors and Power. ANOVA's job is to ask the single, global question first — "is there any difference among these groups at all?" — with one test at one α, so your false-positive rate stays where you set it.
Misconception: just run all the pairwise t-tests
Testing every pair separately quietly inflates your overall false-positive rate above α. With enough groups you're almost guaranteed a spurious "significant" result. Run one ANOVA for the overall question; only if it's significant do you drill into pairs (with a correction). More on controlling this in Errors and Power.
The F-statistic: between-group vs within-group variance
ANOVA's test statistic, F, compares two kinds of spread:
F = (variance BETWEEN group means) / (variance WITHIN groups)
The intuition: if the groups truly have the same mean, then the group averages should differ only as much as the noise inside the groups would predict — so between-spread ≈ within-spread and F ≈ 1. But if some groups are genuinely shifted, the group means spread out more than within-group noise alone can explain, pushing F well above 1.
Same shape as the t-test: a signal (between-group spread) divided by noise (within-group spread), turned into a p-value by the F-distribution.
Running a one-way ANOVA
A delivery company suspects three warehouses have different average
fulfillment times. One call to scipy.stats.f_oneway answers the global
question.
How to read it. The hypotheses are:
- H₀: all group means are equal (μ₁ = μ₂ = μ₃).
- H₁: at least one group mean differs from the rest.
A large F with a small p-value says the group means are spread out more than within-group noise can explain — so you reject H₀ and conclude some difference exists. Crucially, that is all it tells you.
Misconception: ANOVA tells you WHICH group differs
It does not. A significant ANOVA says "at least one group is different" — it's an omnibus test. To find out which pairs differ, you run post-hoc comparisons (e.g. Tukey's HSD) that correct for multiplicity. Think of ANOVA as the smoke alarm: it tells you there's a fire somewhere, not which room.
ANOVA's assumptions (close cousins of the t-test's)
One-way ANOVA assumes independent observations, approximately normal within-group values (the CLT helps for larger groups), and roughly equal variances across groups (homogeneity of variance). If variances differ a lot, a Welch-style ANOVA is the robust analogue — same spirit as preferring Welch's t-test. If normality is badly violated with small groups, the Kruskal–Wallis test is the nonparametric counterpart (a cousin of the methods in Correlation and Nonparametric Tests).
A one-way ANOVA across four marketing channels returns F = 6.1, p = 0.0004. What may you correctly conclude?
Channel A has the highest conversion rate
All four channels have different conversion rates from each other
At least one channel's mean conversion rate differs from the others, but ANOVA alone doesn't say which
The differences are large and important
Part B — Chi-square: tests for categorical data
ANOVA and t-tests need numbers (heights, times, dollars). But masses of real data are categorical: device type, subscription tier, yes/no churn, survey response. Chi-square (χ²) tests are built for exactly this. They come in two main flavors.
Test of independence: are two categorical variables related?
The question it answers: in a contingency table (a cross-tab of two categorical variables), are the variables associated, or independent?
The idea is to compare the observed counts with the counts you'd expect if the two variables were completely unrelated. If observed and expected are close, no association; if they diverge a lot, the variables are linked.
Let's test whether device type is associated with churn.
How to read it. The hypotheses are:
- H₀: the two variables are independent (no association).
- H₁: they are associated (not independent).
chi2_contingency returns four things: the statistic, the p-value, the
degrees of freedom, and the expected counts table. A small p-value
means observed and expected diverge more than chance would allow — so the
variables are related. Notice the expected table: it's what each cell
would look like if device had no bearing on churn. Chi-square is just a
measure of how far reality strays from that "no relationship" world.
Misconception: feed proportions or percentages into chi-square
Chi-square tests work on raw counts, never proportions or
percentages. The test's entire logic rests on how many observations
fall in each cell — 30-out-of-100 carries far less evidence than
3,000-out-of-10,000, even though both are "30%." If you pass in
percentages, the math is meaningless. Always build the table from
counts (which is exactly what pd.crosstab gives you).
Chi-square tests EXISTENCE, not STRENGTH
A significant chi-square tells you an association exists — it says nothing about how strong it is. With a huge sample, even a trivial, practically irrelevant association becomes "significant." To measure strength, use an effect size for tables like Cramér's V (a companion to the effect sizes in Effect Sizes). Existence and magnitude are different questions, just as with every other test.
Goodness-of-fit: does one variable match an expected distribution?
The question it answers: do the observed counts of a single categorical variable match a set of expected proportions?
Classic uses: is a die fair (each face 1/6)? Do website visits split
across days the way we assumed? Did this quarter's support tickets follow
last year's category mix? Here you use stats.chisquare.
The hypotheses: H₀ says the counts follow the expected proportions; H₁ says they don't. Same logic as before — observed vs expected — just for one variable against a reference pattern instead of two variables against each other.
Watch out for small expected counts
The chi-square approximation gets unreliable when expected counts in cells are very small (a common rule of thumb: be wary if any expected count is below 5). With sparse tables — rare categories, small samples — the p-value can be off. Fixes include combining small categories or using Fisher's exact test for 2×2 tables. Note this is about expected counts, not observed ones.
You have a table showing the percentage of users in each subscription tier who churned. You want a chi-square test of independence between tier and churn. What must you do first?
Nothing — chi-square works directly on percentages
Convert back to raw counts per cell, because chi-square requires counts, not percentages
Multiply each percentage by 100 to get whole numbers
Use a t-test instead, since percentages are numeric
Challenge 1 — One-way ANOVA across three groups
Three groups of students learned the same material with three different methods. You have each group's test scores. Test whether the average score differs across methods.
- Run a one-way ANOVA across
method_a,method_b, andmethod_c. - Store the F-statistic as a float
F_statand the p-value as a floatp_value. - Set a string
conclusionto"at least one differs"if the result is significant atalpha = 0.05, otherwise"no difference detected".
Use the right scipy.stats function for comparing 3+ group means — do not run separate t-tests.
Challenge 2 — Chi-square test of independence
You have a DataFrame df with two categorical columns: plan ("basic"/"pro"/"enterprise") and churned (True/False). Test whether the two variables are associated.
- Build a contingency table of counts named
tableusingpd.crosstab(df["plan"], df["churned"]). - Run a chi-square test of independence on that table.
- Store the p-value as a float named
p_value. - Set a boolean
associatedto whether the variables are associated atalpha = 0.05(rejectH0whenp_value <= alpha).
Remember: chi-square needs counts, which is exactly what crosstab produces.
Common misconceptions, gathered
Five ANOVA and chi-square traps
- Running many t-tests instead of ANOVA. Each extra test adds false-positive risk; ANOVA asks the global question once.
- Thinking a significant ANOVA names the group. It's an omnibus "something differs"; use post-hoc tests to find which.
- Feeding proportions/percentages into chi-square. It needs raw counts — the sample size is part of the evidence.
- Reading chi-square as strength of association. It tests existence, not size; use Cramér's V for magnitude.
- Ignoring small expected counts. Below ~5 per cell, the chi-square approximation wobbles; combine categories or use Fisher's exact test.
Check your understanding
Why prefer a single one-way ANOVA over running every pairwise t-test among five groups?
ANOVA is more powerful for every individual pair than a t-test
t-tests cannot be applied to more than two groups at once
Running many pairwise tests inflates the overall false-positive rate above alpha, while one ANOVA tests the global question at the chosen alpha
ANOVA never requires any assumptions
The F-statistic in one-way ANOVA is essentially which ratio?
Variance between the group means divided by variance within the groups
The largest group mean divided by the smallest group mean
The total sample size divided by the number of groups
The p-value divided by the significance level
A chi-square test of independence between region and product preference gives p = 0.002 on a sample of 50,000 customers. What is the safest interpretation?
Region strongly determines product preference
There is evidence that region and preference are associated, but the test says nothing about how strong that association is
The association is definitely large because the sample is large
Region and preference are independent
Which scenario calls for a chi-square goodness-of-fit test (rather than a test of independence)?
Comparing average revenue across four store locations
Checking whether observed counts of dice rolls (1 through 6) match the equal proportions of a fair die
Testing whether browser type is related to whether a user converts
Measuring the correlation between height and weight
A one-way ANOVA on three groups is significant. A colleague concludes "so group C is the best." What's the issue?
Nothing is wrong; a significant ANOVA identifies the top group
ANOVA only signals that at least one group differs; identifying which group(s) requires post-hoc pairwise comparisons
ANOVA requires exactly two groups, so the test was invalid
The colleague should have used a chi-square test
Key takeaways
What to carry forward
- One-way ANOVA compares 3+ group means with a single test; its statistic F = between-group variance ÷ within-group variance is signal over noise, and a large F (small p) means at least one group differs.
- Don't replace ANOVA with many t-tests — that inflates false positives (Errors and Power). And ANOVA is an omnibus test: it says something differs, not which (use post-hoc tests).
- Chi-square handles categorical data: independence (two
variables, via a
crosstab+chi2_contingency) and goodness-of-fit (one variable vs expected proportions, viachisquare). - Chi-square needs counts, not proportions, tests existence not strength (use Cramér's V for magnitude), and gets shaky with small expected counts (< ~5 per cell).
- Match the test to the data type: numeric means → t-test/ANOVA; categorical counts → chi-square.
t-Tests
How to compare means under uncertainty with one-sample, two-sample (Welch), and paired t-tests — the t-statistic as signal divided by noise, the assumptions that matter, and how to interpret t, p, an interval, and an effect size.
Correlation and Nonparametric Tests
Measuring relationships with Pearson and Spearman correlation, a light touch of regression intuition, and the distribution-free Mann-Whitney U and Wilcoxon tests for when t-tests are unsafe.