ANOVA and Chi-Square

Two essential tests beyond the t-test — one-way ANOVA for comparing the means of three or more groups, and chi-square tests for categorical data (independence and goodness-of-fit), with the intuition, the assumptions, and how to read the results.

The t-test compares two means. But real questions rarely stop at two. Does conversion differ across four landing pages? Do delivery times differ across three warehouses? And what about questions that aren't about means at all — is device type related to whether a user churns? Those are counts of categories, not averages.

This page covers the two tools that handle these cases:

One-way ANOVA — compares the means of 3+ groups at once.
Chi-square tests — work with categorical data: are two categorical variables associated (independence), and does a category's counts match an expected pattern (goodness-of-fit)?

Both are one-liners in scipy.stats. As always, the value is in picking the right one and reading the result honestly.

Part A — One-way ANOVA: comparing 3+ group means

Why not just run a bunch of t-tests?

The tempting move with four groups is to t-test every pair: A vs B, A vs C, A vs D, B vs C, B vs D, C vs D — six tests. The problem is the multiplicity of chances to be wrong. Each test at α = 0.05 carries a 5% false-positive risk. Run six and the chance that at least one fires by pure luck climbs well above 5%.

That inflation is the heart of the multiple comparisons problem — we'll formalize it in Errors and Power. ANOVA's job is to ask the single, global question first — "is there any difference among these groups at all?" — with one test at one α, so your false-positive rate stays where you set it.

Misconception: just run all the pairwise t-tests

Testing every pair separately quietly inflates your overall false-positive rate above α. With enough groups you're almost guaranteed a spurious "significant" result. Run one ANOVA for the overall question; only if it's significant do you drill into pairs (with a correction). More on controlling this in Errors and Power.

The F-statistic: between-group vs within-group variance

ANOVA's test statistic, F, compares two kinds of spread:

F = (variance BETWEEN group means) / (variance WITHIN groups)

The intuition: if the groups truly have the same mean, then the group averages should differ only as much as the noise inside the groups would predict — so between-spread ≈ within-spread and F ≈ 1. But if some groups are genuinely shifted, the group means spread out more than within-group noise alone can explain, pushing F well above 1.

Same shape as the t-test: a signal (between-group spread) divided by noise (within-group spread), turned into a p-value by the F-distribution.

Running a one-way ANOVA

A delivery company suspects three warehouses have different average fulfillment times. One call to scipy.stats.f_oneway answers the global question.

How to read it. The hypotheses are:

H₀: all group means are equal (μ₁ = μ₂ = μ₃).
H₁: at least one group mean differs from the rest.

A large F with a small p-value says the group means are spread out more than within-group noise can explain — so you reject H₀ and conclude some difference exists. Crucially, that is all it tells you.

Misconception: ANOVA tells you WHICH group differs

It does not. A significant ANOVA says "at least one group is different" — it's an omnibus test. To find out which pairs differ, you run post-hoc comparisons (e.g. Tukey's HSD) that correct for multiplicity. Think of ANOVA as the smoke alarm: it tells you there's a fire somewhere, not which room.

ANOVA's assumptions (close cousins of the t-test's)

One-way ANOVA assumes independent observations, approximately normal within-group values (the CLT helps for larger groups), and roughly equal variances across groups (homogeneity of variance). If variances differ a lot, a Welch-style ANOVA is the robust analogue — same spirit as preferring Welch's t-test. If normality is badly violated with small groups, the Kruskal–Wallis test is the nonparametric counterpart (a cousin of the methods in Correlation and Nonparametric Tests).

QuestionSelect one

A one-way ANOVA across four marketing channels returns F = 6.1, p = 0.0004. What may you correctly conclude?

Channel A has the highest conversion rate

All four channels have different conversion rates from each other

At least one channel's mean conversion rate differs from the others, but ANOVA alone doesn't say which

The differences are large and important

Part B — Chi-square: tests for categorical data

ANOVA and t-tests need numbers (heights, times, dollars). But masses of real data are categorical: device type, subscription tier, yes/no churn, survey response. Chi-square (χ²) tests are built for exactly this. They come in two main flavors.

The question it answers: in a contingency table (a cross-tab of two categorical variables), are the variables associated, or independent?

The idea is to compare the observed counts with the counts you'd expect if the two variables were completely unrelated. If observed and expected are close, no association; if they diverge a lot, the variables are linked.

Let's test whether device type is associated with churn.

How to read it. The hypotheses are:

H₀: the two variables are independent (no association).
H₁: they are associated (not independent).

chi2_contingency returns four things: the statistic, the p-value, the degrees of freedom, and the expected counts table. A small p-value means observed and expected diverge more than chance would allow — so the variables are related. Notice the expected table: it's what each cell would look like if device had no bearing on churn. Chi-square is just a measure of how far reality strays from that "no relationship" world.

Misconception: feed proportions or percentages into chi-square

Chi-square tests work on raw counts, never proportions or percentages. The test's entire logic rests on how many observations fall in each cell — 30-out-of-100 carries far less evidence than 3,000-out-of-10,000, even though both are "30%." If you pass in percentages, the math is meaningless. Always build the table from counts (which is exactly what pd.crosstab gives you).

Chi-square tests EXISTENCE, not STRENGTH

A significant chi-square tells you an association exists — it says nothing about how strong it is. With a huge sample, even a trivial, practically irrelevant association becomes "significant." To measure strength, use an effect size for tables like Cramér's V (a companion to the effect sizes in Effect Sizes). Existence and magnitude are different questions, just as with every other test.

Goodness-of-fit: does one variable match an expected distribution?

The question it answers: do the observed counts of a single categorical variable match a set of expected proportions?

Classic uses: is a die fair (each face 1/6)? Do website visits split across days the way we assumed? Did this quarter's support tickets follow last year's category mix? Here you use stats.chisquare.

The hypotheses: H₀ says the counts follow the expected proportions; H₁ says they don't. Same logic as before — observed vs expected — just for one variable against a reference pattern instead of two variables against each other.

Watch out for small expected counts

The chi-square approximation gets unreliable when expected counts in cells are very small (a common rule of thumb: be wary if any expected count is below 5). With sparse tables — rare categories, small samples — the p-value can be off. Fixes include combining small categories or using Fisher's exact test for 2×2 tables. Note this is about expected counts, not observed ones.

QuestionSelect one

You have a table showing the percentage of users in each subscription tier who churned. You want a chi-square test of independence between tier and churn. What must you do first?

Nothing — chi-square works directly on percentages

Convert back to raw counts per cell, because chi-square requires counts, not percentages

Multiply each percentage by 100 to get whole numbers

Use a t-test instead, since percentages are numeric

Challenge 1 — One-way ANOVA across three groups

Three groups of students learned the same material with three different methods. You have each group's test scores. Test whether the average score differs across methods.

Run a one-way ANOVA across method_a, method_b, and method_c.
Store the F-statistic as a float F_stat and the p-value as a float p_value.
Set a string conclusion to "at least one differs" if the result is significant at alpha = 0.05, otherwise "no difference detected".

Use the right scipy.stats function for comparing 3+ group means — do not run separate t-tests.

Challenge 2 — Chi-square test of independence

You have a DataFrame df with two categorical columns: plan ("basic"/"pro"/"enterprise") and churned (True/False). Test whether the two variables are associated.

Build a contingency table of counts named table using pd.crosstab(df["plan"], df["churned"]).
Run a chi-square test of independence on that table.
Store the p-value as a float named p_value.
Set a boolean associated to whether the variables are associated at alpha = 0.05 (reject H0 when p_value <= alpha).

Remember: chi-square needs counts, which is exactly what crosstab produces.

Common misconceptions, gathered

Five ANOVA and chi-square traps

Running many t-tests instead of ANOVA. Each extra test adds false-positive risk; ANOVA asks the global question once.
Thinking a significant ANOVA names the group. It's an omnibus "something differs"; use post-hoc tests to find which.
Feeding proportions/percentages into chi-square. It needs raw counts — the sample size is part of the evidence.
Reading chi-square as strength of association. It tests existence, not size; use Cramér's V for magnitude.
Ignoring small expected counts. Below ~5 per cell, the chi-square approximation wobbles; combine categories or use Fisher's exact test.

Check your understanding

QuestionSelect one

Why prefer a single one-way ANOVA over running every pairwise t-test among five groups?

ANOVA is more powerful for every individual pair than a t-test

t-tests cannot be applied to more than two groups at once

Running many pairwise tests inflates the overall false-positive rate above alpha, while one ANOVA tests the global question at the chosen alpha

ANOVA never requires any assumptions

QuestionSelect one

The F-statistic in one-way ANOVA is essentially which ratio?

Variance between the group means divided by variance within the groups

The largest group mean divided by the smallest group mean

The total sample size divided by the number of groups

The p-value divided by the significance level

QuestionSelect one

A chi-square test of independence between region and product preference gives p = 0.002 on a sample of 50,000 customers. What is the safest interpretation?

Region strongly determines product preference

There is evidence that region and preference are associated, but the test says nothing about how strong that association is

The association is definitely large because the sample is large

Region and preference are independent

QuestionSelect one

Which scenario calls for a chi-square goodness-of-fit test (rather than a test of independence)?

Comparing average revenue across four store locations

Checking whether observed counts of dice rolls (1 through 6) match the equal proportions of a fair die

Testing whether browser type is related to whether a user converts

Measuring the correlation between height and weight

QuestionSelect one

A one-way ANOVA on three groups is significant. A colleague concludes "so group C is the best." What's the issue?

Nothing is wrong; a significant ANOVA identifies the top group

ANOVA only signals that at least one group differs; identifying which group(s) requires post-hoc pairwise comparisons

ANOVA requires exactly two groups, so the test was invalid

The colleague should have used a chi-square test

Key takeaways

What to carry forward

One-way ANOVA compares 3+ group means with a single test; its statistic F = between-group variance ÷ within-group variance is signal over noise, and a large F (small p) means at least one group differs.
Don't replace ANOVA with many t-tests — that inflates false positives (Errors and Power). And ANOVA is an omnibus test: it says something differs, not which (use post-hoc tests).
Chi-square handles categorical data: independence (two variables, via a crosstab + chi2_contingency) and goodness-of-fit (one variable vs expected proportions, via chisquare).
Chi-square needs counts, not proportions, tests existence not strength (use Cramér's V for magnitude), and gets shaky with small expected counts (< ~5 per cell).
Match the test to the data type: numeric means → t-test/ANOVA; categorical counts → chi-square.

ANOVA and Chi-Square

On this page