Exploratory Statistical Analysis
A disciplined EDA workflow that uses statistics to understand distributions, relationships, missingness, and outliers — while keeping a hard wall between exploration that generates hypotheses and confirmation that tests them on fresh data.
Before you test anything, you have to look. Exploratory data analysis (EDA) is the open-ended, curious first pass over a fresh dataset: what's in here, how is each variable distributed, what's missing, what relates to what, and what's surprising? You already do a version of this with Pandas. This page adds the statistical discipline — summaries that quantify spread and skew, relationships measured rather than eyeballed, and a clear sense of which findings are real signal versus which are the kind of noise Why Statistics Matters warned you about.
But EDA carries a hidden danger that this page treats as its spine. Exploration generates hypotheses; it must never be mistaken for confirming them. The single most important habit in all of applied statistics is to keep those two activities on opposite sides of a wall.
Exploratory vs confirmatory: the wall you must not cross
There are two fundamentally different modes of data analysis, and conflating them is how analyses go wrong.
- Exploratory (hypothesis-generating): roam the data, slice it many ways, follow your curiosity, look at hundreds of summaries and plots. Output: interesting questions and candidate hypotheses. Nothing here is a conclusion.
- Confirmatory (hypothesis-testing): state a specific, pre-declared hypothesis and test it — ideally on data you have not yet looked at. Output: a defensible claim with a p-value, an effect size, and a confidence interval.
The dotted arrow is the whole game. The hypothesis you discovered in exploration is a lead, not a verdict. To confirm it honestly you need data that did not participate in suggesting it.
The cardinal sin: testing a hypothesis on the data that suggested it
If you scan the data, notice "Tuesday signups convert unusually well," and then run a test on the same data asking whether Tuesday converts better — your p-value is a lie. You already know the answer is "yes" in this dataset; that's why the pattern caught your eye. Testing it here inflates false positives exactly like the multiple-comparisons trap in Statistical Fallacies (the "garden of forking paths"). Confirm on new data, a held-out split, or a fresh time window.
A practical workaround: hold out a slice
When you can't collect new data, split it before you start exploring: explore freely on, say, 70% of the rows, and lock away the other 30% untouched. When a hypothesis emerges, test it once on the held-out 30%. That slice never influenced your hypothesis, so the test is honest.
The EDA pipeline
A disciplined exploration moves through a rough sequence. It's iterative, not linear — findings send you back upstream — but the stages give you a checklist so you don't miss something basic.
We'll walk this with a real dataset. The tips dataset records restaurant bills, tips, party size, day, and a few categorical traits — small enough to reason about, rich enough to be interesting.
Step 1-2: the shape and each variable's distribution
Start by orienting yourself: how big is the table, what are the columns,
and what does each one look like on its own. df.describe() gives a
fast statistical snapshot of the numeric columns — center, spread, and
the quartiles that hint at skew.
Read describe() like a statistician, not just a reader of numbers. For
total_bill: is the mean well above the median? That signals right
skew (a few big bills pulling the average up) — the lesson from Shape
and Outliers. Compare the standard deviation to the mean to gauge
relative spread. The 25th-to-75th percentile range tells you where the
bulk of the data lives.
For categorical columns, describe() won't help — use value_counts
to see the categories and how (im)balanced they are.
A picture makes the shape obvious. A histogram of total_bill shows the
right skew directly; a box plot summarizes center, spread, and flags
extreme values.
What you're hunting for in distributions
Shape (symmetric vs skewed), center, spread, and anything weird: suspicious spikes, impossible values, a long tail. Each oddity is a question to chase — "why is there a cluster at exactly zero?" — not something to silently smooth over.
Step 3: missingness
Real datasets have holes. Before any analysis, quantify how much is missing and where, because the pattern of missingness can bias everything downstream. The tips dataset is complete, so we'll inject a realistic gap to demonstrate the checks you'd run on messy data.
Missingness is rarely random
Dropping rows with missing values is only safe if data is missing
completely at random. Often it isn't: maybe high bills are the ones
where the tip went unrecorded, so dropna() would silently bias your
average tip downward. Always ask why a value is missing before
deciding how to handle it — the mechanism matters more than the count.
Step 4: outliers
Outliers are either data errors to fix or real extremes that carry information — and EDA is where you tell them apart. A common quick rule is the 1.5 × IQR fence: points beyond the quartiles by more than 1.5 times the interquartile range are flagged for inspection (not automatic deletion).
Don't delete outliers reflexively
"Removing outliers" to make a chart prettier is a classic way to lie with data. A high value is only an error if it's actually impossible or mis-entered. A genuine \$50 bill from a big party is real signal — deleting it distorts the truth. Investigate first; delete only with a documented reason.
Step 5: relationships
Now look at how variables move together. Two workhorses: groupby
summaries (how a metric differs across categories) and a correlation
matrix (linear association between numeric columns).
A correlation matrix scans all numeric pairs at once. It's the fastest way to spot which variables travel together — and a natural source of hypotheses.
Correlation matrices are an exploration tool, not a conclusion machine
A correlation matrix invites the multiple-comparisons trap: with k
columns you're staring at many pairs at once, and some will look notable
by chance. Use it to generate leads, never to declare "X and Y are
related" off the back of one cell. And remember from Effect Sizes and
Correlation and Nonparametric Tests: correlation measures linear
association only, and never implies causation.
Step 6: forming hypotheses (to confirm later)
This is the output of EDA — a short list of specific, testable hypotheses, each tagged "not yet confirmed." From the exploration above, honest candidates might be:
- Tip percentage differs by day of week (weekends vs weekdays).
- Dinner parties tip a different percentage than lunch parties.
- Larger parties tip a lower percentage (a real, often-cited effect).
Each is a hypothesis you'd take to a confirmatory test — a t-test, an ANOVA, or a chi-square from the Hypothesis Testing section — run once, on data that didn't generate the idea. Until then, they are leads, full stop.
During EDA you notice that, in your dataset, customers who used coupon "SAVE10" churned 12% less. What is the correct status of this finding?
A confirmed result you can report: coupons reduce churn by 12%
A hypothesis to test later on fresh or held-out data, since exploration generated it
Proof that coupons cause lower churn
Meaningless, because EDA findings are always noise
Practice
The tips dataset is loaded as tips. Explore whether tip percentage differs by day.
- Compute tip percentage as
tip / total_bill * 100. - Group by
dayand compute the mean tip percentage per day. - Store the result in a pandas
Seriesnamedtip_pct_by_day, indexed by day, with the mean tip percentage as values.
This grouped summary is a generated hypothesis — "tip % varies by day" — not yet a confirmed conclusion.
Among the numeric columns of tips, find the pair with the strongest positive linear correlation (excluding each column's correlation with itself).
Produce two variables:
best_pair: a tuple of two column-name strings(col_a, col_b)for the most correlated distinct pair.best_r: a Pythonfloat, their Pearson correlation.
The pair order within the tuple does not matter for grading. This is the strongest relationship EDA surfaces — a hypothesis worth confirming, not yet a conclusion.
Check your understanding
What is the defining purpose of exploratory data analysis (as opposed to confirmatory)?
To produce final p-values and decisions
To understand the data and generate candidate hypotheses, without treating any finding as confirmed
To prove your hypothesis is correct using the data you have
To clean the data only, with no analysis
You explore a dataset, spot that "users from Region X spend more," and immediately run a t-test on that same dataset comparing Region X to others. Why is the resulting p-value untrustworthy?
Because t-tests can't be used on spending data
Because the sample is always too small for a t-test
Because the hypothesis was chosen because it stood out in this data, so testing it on the same data inflates the false-positive rate
Because p-values are never trustworthy
During EDA you find 6% of rows are missing the income field. What's the most responsible next step?
Immediately drop those rows and proceed
Fill them all with zero so the column is complete
Investigate why income is missing and whether the missingness relates to other variables before choosing how to handle it
Ignore it, since 6% is small
A correlation matrix shows age and income correlate at 0.45 in your data. Which statement is the honest EDA conclusion?
Age causes higher income; report it
The relationship is definitely real and strong
There's a moderate positive association in this sample worth forming into a hypothesis and testing properly later
The correlation proves nothing and should be discarded
Key takeaways
- EDA is hypothesis-generating: understand distributions, missingness, outliers, and relationships, then form candidate questions.
- Keep a wall between exploratory (look freely) and confirmatory (test a pre-stated hypothesis on fresh/held-out data).
- Never test a hypothesis on the same data that suggested it — it inflates false positives (the garden of forking paths).
- Read
describe()for skew and spread; usevalue_countsfor balance; quantify missingness and ask why; flag outliers but don't delete reflexively. - Correlation matrices and
groupbysummaries surface leads — confirm them later with the tools from Hypothesis Testing.
Statistical Fallacies
The recurring reasoning traps that sink real analyses — Simpson's paradox, confounding, p-hacking, base-rate neglect, survivorship bias, regression to the mean, cherry-picking, and the Texas sharpshooter — what each one is, a concrete example, and how to avoid it.
A/B Testing
The capstone — design a real experiment with a clear hypothesis, one primary metric, randomization, and power up front; then analyze it fully with a test, a lift, a confidence interval, and an effect size; and decide to ship or not by weighing significance against practical importance.