Exploratory Statistical Analysis

A disciplined EDA workflow that uses statistics to understand distributions, relationships, missingness, and outliers — while keeping a hard wall between exploration that generates hypotheses and confirmation that tests them on fresh data.

Before you test anything, you have to look. Exploratory data analysis (EDA) is the open-ended, curious first pass over a fresh dataset: what's in here, how is each variable distributed, what's missing, what relates to what, and what's surprising? You already do a version of this with Pandas. This page adds the statistical discipline — summaries that quantify spread and skew, relationships measured rather than eyeballed, and a clear sense of which findings are real signal versus which are the kind of noise Why Statistics Matters warned you about.

But EDA carries a hidden danger that this page treats as its spine. Exploration generates hypotheses; it must never be mistaken for confirming them. The single most important habit in all of applied statistics is to keep those two activities on opposite sides of a wall.

Exploratory vs confirmatory: the wall you must not cross

There are two fundamentally different modes of data analysis, and conflating them is how analyses go wrong.

Exploratory (hypothesis-generating): roam the data, slice it many ways, follow your curiosity, look at hundreds of summaries and plots. Output: interesting questions and candidate hypotheses. Nothing here is a conclusion.
Confirmatory (hypothesis-testing): state a specific, pre-declared hypothesis and test it — ideally on data you have not yet looked at. Output: a defensible claim with a p-value, an effect size, and a confidence interval.

The dotted arrow is the whole game. The hypothesis you discovered in exploration is a lead, not a verdict. To confirm it honestly you need data that did not participate in suggesting it.

The cardinal sin: testing a hypothesis on the data that suggested it

If you scan the data, notice "Tuesday signups convert unusually well," and then run a test on the same data asking whether Tuesday converts better — your p-value is a lie. You already know the answer is "yes" in this dataset; that's why the pattern caught your eye. Testing it here inflates false positives exactly like the multiple-comparisons trap in Statistical Fallacies (the "garden of forking paths"). Confirm on new data, a held-out split, or a fresh time window.

A practical workaround: hold out a slice

When you can't collect new data, split it before you start exploring: explore freely on, say, 70% of the rows, and lock away the other 30% untouched. When a hypothesis emerges, test it once on the held-out 30%. That slice never influenced your hypothesis, so the test is honest.

The EDA pipeline

A disciplined exploration moves through a rough sequence. It's iterative, not linear — findings send you back upstream — but the stages give you a checklist so you don't miss something basic.

We'll walk this with a real dataset. The tips dataset records restaurant bills, tips, party size, day, and a few categorical traits — small enough to reason about, rich enough to be interesting.

Step 1-2: the shape and each variable's distribution

Start by orienting yourself: how big is the table, what are the columns, and what does each one look like on its own. df.describe() gives a fast statistical snapshot of the numeric columns — center, spread, and the quartiles that hint at skew.

Read describe() like a statistician, not just a reader of numbers. For total_bill: is the mean well above the median? That signals right skew (a few big bills pulling the average up) — the lesson from Shape and Outliers. Compare the standard deviation to the mean to gauge relative spread. The 25th-to-75th percentile range tells you where the bulk of the data lives.

For categorical columns, describe() won't help — use value_counts to see the categories and how (im)balanced they are.

A picture makes the shape obvious. A histogram of total_bill shows the right skew directly; a box plot summarizes center, spread, and flags extreme values.

What you're hunting for in distributions

Shape (symmetric vs skewed), center, spread, and anything weird: suspicious spikes, impossible values, a long tail. Each oddity is a question to chase — "why is there a cluster at exactly zero?" — not something to silently smooth over.

Step 3: missingness

Real datasets have holes. Before any analysis, quantify how much is missing and where, because the pattern of missingness can bias everything downstream. The tips dataset is complete, so we'll inject a realistic gap to demonstrate the checks you'd run on messy data.

Missingness is rarely random

Dropping rows with missing values is only safe if data is missing completely at random. Often it isn't: maybe high bills are the ones where the tip went unrecorded, so dropna() would silently bias your average tip downward. Always ask why a value is missing before deciding how to handle it — the mechanism matters more than the count.

Step 4: outliers

Outliers are either data errors to fix or real extremes that carry information — and EDA is where you tell them apart. A common quick rule is the 1.5 × IQR fence: points beyond the quartiles by more than 1.5 times the interquartile range are flagged for inspection (not automatic deletion).

Don't delete outliers reflexively

"Removing outliers" to make a chart prettier is a classic way to lie with data. A high value is only an error if it's actually impossible or mis-entered. A genuine \$50 bill from a big party is real signal — deleting it distorts the truth. Investigate first; delete only with a documented reason.

Step 5: relationships

Now look at how variables move together. Two workhorses: groupby summaries (how a metric differs across categories) and a correlation matrix (linear association between numeric columns).

A correlation matrix scans all numeric pairs at once. It's the fastest way to spot which variables travel together — and a natural source of hypotheses.

Correlation matrices are an exploration tool, not a conclusion machine

A correlation matrix invites the multiple-comparisons trap: with k columns you're staring at many pairs at once, and some will look notable by chance. Use it to generate leads, never to declare "X and Y are related" off the back of one cell. And remember from Effect Sizes and Correlation and Nonparametric Tests: correlation measures linear association only, and never implies causation.

Step 6: forming hypotheses (to confirm later)

This is the output of EDA — a short list of specific, testable hypotheses, each tagged "not yet confirmed." From the exploration above, honest candidates might be:

Tip percentage differs by day of week (weekends vs weekdays).
Dinner parties tip a different percentage than lunch parties.
Larger parties tip a lower percentage (a real, often-cited effect).

Each is a hypothesis you'd take to a confirmatory test — a t-test, an ANOVA, or a chi-square from the Hypothesis Testing section — run once, on data that didn't generate the idea. Until then, they are leads, full stop.

QuestionSelect one

During EDA you notice that, in your dataset, customers who used coupon "SAVE10" churned 12% less. What is the correct status of this finding?

A confirmed result you can report: coupons reduce churn by 12%

A hypothesis to test later on fresh or held-out data, since exploration generated it

Proof that coupons cause lower churn

Meaningless, because EDA findings are always noise

Practice

The tips dataset is loaded as tips. Explore whether tip percentage differs by day.

Compute tip percentage as tip / total_bill * 100.
Group by day and compute the mean tip percentage per day.
Store the result in a pandas Series named tip_pct_by_day, indexed by day, with the mean tip percentage as values.

This grouped summary is a generated hypothesis — "tip % varies by day" — not yet a confirmed conclusion.

Among the numeric columns of tips, find the pair with the strongest positive linear correlation (excluding each column's correlation with itself).

Produce two variables:

best_pair: a tuple of two column-name strings (col_a, col_b) for the most correlated distinct pair.
best_r: a Python float, their Pearson correlation.

The pair order within the tuple does not matter for grading. This is the strongest relationship EDA surfaces — a hypothesis worth confirming, not yet a conclusion.

Check your understanding

QuestionSelect one

What is the defining purpose of exploratory data analysis (as opposed to confirmatory)?

To produce final p-values and decisions

To understand the data and generate candidate hypotheses, without treating any finding as confirmed

To prove your hypothesis is correct using the data you have

To clean the data only, with no analysis

QuestionSelect one

You explore a dataset, spot that "users from Region X spend more," and immediately run a t-test on that same dataset comparing Region X to others. Why is the resulting p-value untrustworthy?

Because t-tests can't be used on spending data

Because the sample is always too small for a t-test

Because the hypothesis was chosen because it stood out in this data, so testing it on the same data inflates the false-positive rate

Because p-values are never trustworthy

QuestionSelect one

During EDA you find 6% of rows are missing the income field. What's the most responsible next step?

Immediately drop those rows and proceed

Fill them all with zero so the column is complete

Investigate why income is missing and whether the missingness relates to other variables before choosing how to handle it

Ignore it, since 6% is small

QuestionSelect one

A correlation matrix shows age and income correlate at 0.45 in your data. Which statement is the honest EDA conclusion?

Age causes higher income; report it

The relationship is definitely real and strong

There's a moderate positive association in this sample worth forming into a hypothesis and testing properly later

The correlation proves nothing and should be discarded

Key takeaways

EDA is hypothesis-generating: understand distributions, missingness, outliers, and relationships, then form candidate questions.
Keep a wall between exploratory (look freely) and confirmatory (test a pre-stated hypothesis on fresh/held-out data).
Never test a hypothesis on the same data that suggested it — it inflates false positives (the garden of forking paths).
Read describe() for skew and spread; use value_counts for balance; quantify missingness and ask why; flag outliers but don't delete reflexively.
Correlation matrices and groupby summaries surface leads — confirm them later with the tools from Hypothesis Testing.

Exploratory Statistical Analysis

On this page