Capstone: A Full EDA
Put it all together — explore the penguins dataset end to end, from first glance to a communication-ready figure.
Every chapter so far taught one chart family in isolation. Real exploratory data analysis (EDA) does not work one chart at a time — it is a loop. You glance at the raw table, ask a question, draw the chart that answers it, notice something new, and ask the next question. The skill is not knowing twenty chart types; it is knowing which one to reach for at each step, and reading what it reveals before moving on.
This page is a full pass through that loop on the penguins dataset — 344 penguins measured across three species and three islands. We'll go from "what is even in this table?" to a single, polished figure you could drop into a report. Along the way you will write a lot of the code yourself.
The EDA loop, in one breath
Glance at the data, look at one variable at a time, compare groups, study relationships between pairs, then zoom out to all pairs at once — and finally turn your best finding into a figure that communicates. Each step narrows the question for the next.
Step 1 — First glance
Before any chart, look at the table itself. Three quick checks answer "what am I working with?": the shape (how many rows and columns), the head (what the columns and values actually look like), and the missing values (where the holes are). Skipping this step is how people end up plotting a column that is half empty and drawing the wrong conclusion.
Read the output top to bottom. The shape is (344, 7): 344 penguins,
seven columns. The head shows the columns split cleanly into who the
penguin is (species, island, sex — categorical labels) and how it
was measured (bill_length_mm, bill_depth_mm, flipper_length_mm,
body_mass_g — numbers). That split is the single most useful thing to
notice, because it tells you which charts are even possible: numeric
columns can go on a scatter plot's axes; categorical columns can color or
split it.
The missing-value count shows a handful of NaNs — two penguins are
missing every measurement, and a few more are missing sex. This is
normal, and Seaborn quietly drops missing rows per plot, so we can keep
moving. But now we know the holes are there instead of being surprised
later.
Categorical vs numeric is the first fork
The moment you can sort a table's columns into "labels" and "numbers" you
already know your options: a histogram or box plot for one numeric column,
a scatter for two numeric columns, hue/col for the categorical ones.
The data types drive the chart choice — not the other way around.
Step 2 — One variable at a time
With the lay of the land clear, start narrow: look at a single numeric
column on its own. A histogram (via displot with kind="hist")
turns one column of numbers into a shape — where values pile up, how spread
out they are, whether there is more than one peak.
Here is body_mass_g, the penguins' weight:
The shape is not a single tidy hump. There is a big cluster of lighter penguins on the left and a smaller, heavier group trailing off to the right — a hint that more than one kind of penguin is mixed into this column. That hint is exactly the kind of thing EDA exists to surface: a single-variable plot raised a question ("who are the heavy ones?") that the next steps will answer.
Now you try the same move on a different column — and split it by species so the groups show up in color.
Using the penguins dataset, draw a histogram of
flipper_length_mm with sns.displot:
- put
flipper_length_mmon the x-axis, - use
kind="hist", - color the bars by species with
hue="species".
Assign the result to a variable named g.
Running your version, the colors tell the story the body-mass histogram only hinted at: the right-hand group is almost entirely one species. A single variable, split by a category, already starts separating the penguins into kinds.
Step 3 — Comparing groups
The histogram suggested the species differ. The natural next question is
by how much, and on which measurements? When one axis is a category
and the other is a number, that is a job for catplot. A
box plot (kind="box") summarizes each group's spread — the box is the
middle 50%, the line is the median, the whiskers reach most of the rest —
so you can line the groups up and compare centers and overlap at a glance.
Body mass by species:
Now the suspicion is confirmed and quantified: Gentoo penguins are
clearly heavier than Adelie and Chinstrap, whose boxes sit lower
and overlap each other heavily. The box plot earns its keep here — three
distributions compared in one glance, with medians and spread visible at
once. (Swap kind="box" for kind="violin" to see each group's full
density shape instead of a five-number summary.)
The other categorical column we noticed in Step 1 was island. Does weight
vary by where a penguin lives, the way it varies by species? Find out.
Using the penguins dataset, draw a categorical plot with
sns.catplot comparing body_mass_g across islands:
- put
islandon the x-axis, - put
body_mass_gon the y-axis, - use
kind="box"(orkind="violin"if you prefer).
Assign the result to a variable named g.
Your island plot shows that Biscoe runs heavier than the others — but be careful with the why. Islands are not equally stocked with species (Gentoo live only on Biscoe), so "Biscoe penguins are heavy" is really "Gentoo live on Biscoe, and Gentoo are heavy." A grouping variable can be a stand-in for a different cause. EDA is as much about catching these confounds as about finding patterns.
Step 4 — Relationships between pairs
Single variables and group comparisons are warm-ups. The richest questions
in this dataset are about how two measurements move together. Back to the
scatter plot (relplot), now armed with what we know — color by
species so the structure is unmistakable.
Flipper length against body mass:
This is one of the cleanest relationships you will ever see: longer
flippers go with heavier bodies, tightly and almost linearly, and the three
species fall into their own regions of the cloud. Gentoo sit up and to the
right (long flippers, heavy); Adelie and Chinstrap overlap in the
lower-left. The scatter plot shows the relationship; hue shows that the
relationship is shared across groups that nonetheless occupy different
parts of it.
When a relationship looks linear, it is natural to ask for the line.
lmplot draws a scatter plus a fitted regression line per group — a quick
way to compare trends. Try it on the two bill measurements, which (you may
recall from earlier) hide a surprise when you split by species.
Using the penguins dataset, draw a regression plot with
sns.lmplot:
- put
bill_length_mmon the x-axis, - put
bill_depth_mmon the y-axis, - color (and fit a separate line) by species with
hue="species".
Assign the result to a variable named g.
Look at the three fitted lines. Within each species, longer bills are
also deeper — every line slopes upward. Yet if you erased the colors
and fitted one line to the whole cloud, the trend would look flat or even
downward, because Gentoo (long, shallow bills) pull the overall picture
sideways. That reversal — a within-group trend that vanishes or flips when
groups are pooled — is Simpson's paradox, and lmplot with hue makes
it visible in a single chart.
Step 5 — All pairs at once
You have been picking pairs of columns by hand. With only four numeric columns there are not that many pairs — so why not look at all of them at once? A pair plot draws every numeric column against every other in a grid, with each variable's own distribution down the diagonal. It is the fastest way to scan an entire numeric dataset for structure.
In one figure you can confirm everything the earlier steps found and spot things you had not looked for: flipper-vs-mass is tight and linear; the bill measurements separate the species into neat clusters; and the diagonal shows how each measurement alone is split by species. A pair plot is busy, but for a handful of numeric columns it is an unbeatable overview.
A complementary view of "everything at once" is the correlation matrix:
a single number per pair of columns summarizing how strongly they move
together, from -1 (perfect opposite) through 0 (no linear relation) to
+1 (perfect together). A heatmap turns that matrix into color.
Two choices make this heatmap honest. The palette "vlag" is diverging
— it runs from one color through white to another — and center=0 pins
white to zero, so positive correlations read as one hue and negative ones
as the opposite, with strength shown by intensity. A sequential palette, or
forgetting center=0, would make a +0.1 and a -0.1 look misleadingly
different in size. The bright cell between flipper length and body mass
(near +0.87) is the strong relationship you already saw as a scatter — now
quantified.
That corr matrix has structure worth understanding directly, not just
coloring. Build it yourself and check two facts that are true of every
correlation matrix.
A correlation matrix is square (one row and one column per
numeric variable) and every variable correlates perfectly with itself, so
its diagonal is all 1.
Using the penguins dataset, build the correlation matrix of its numeric
columns and assign it to a variable named corr:
corr = penguins.corr(numeric_only=True)
The tests will confirm that corr is square and that its top-left entry
(a variable's correlation with itself) is 1.
The diagonal being exactly 1 and the matrix being square are not
coincidences — they fall out of what correlation is. Knowing that lets
you read a heatmap critically: the diagonal carries no information (it is
always the brightest, "perfect" color), so your eye should go straight to
the off-diagonal cells where the real relationships live.
Step 6 — Communicate the finding
Exploration is for you; a final figure is for someone else. The last move of an EDA is to take the single most important thing you found and draw it so a reader gets it in seconds — with a message-style title that states the takeaway, not just the variable names.
Our cleanest finding was that the species separate by bill shape. Here is that scatter, dressed for an audience: a title that says what the chart means, and clear axis labels.
Compare this to the bare scatter from Step 4. The data is identical; what changed is that a reader no longer has to derive the point — the title hands it to them, and the labels read in plain English. That is the difference between an exploratory figure (terse, fast, for yourself) and an explanatory one (titled, labeled, for an audience). Both are legitimate; knowing which you are making is the skill.
Title with the verb, not the nouns
"bill_length_mm vs bill_depth_mm" describes the axes. "Penguin species
separate cleanly by bill shape" describes the finding. For a figure meant
to persuade or inform, prefer the sentence with a verb in it — it tells the
reader what to take away before they finish reading the axes.
Check your understanding
You open a brand-new dataset you have never seen. Which step makes the most sense first, before drawing any chart?
Fit a regression line to the two columns whose names sound related.
Inspect .shape, .head(), and .isna().sum() to learn the size, the columns, and where values are missing.
Immediately draw a pair plot of every column against every other.
Pick a color palette so your charts look polished from the start.
You want to compare the body_mass_g distribution across the three penguin
species and read each group's median and spread at a glance. Which chart
fits the question best?
A scatter plot of body_mass_g against flipper_length_mm.
A single histogram of body_mass_g with no grouping.
A box (or violin) plot with species on one axis and body_mass_g on the other.
A correlation heatmap of the numeric columns.
On a correlation heatmap, why is it important to use a diverging
palette with center=0 (for example cmap="vlag", center=0)?
It makes the figure load faster.
It pins a neutral color to zero so positive and negative correlations read as opposite hues, with strength shown by intensity.
It forces every correlation to fall between 0 and 1.
It hides the diagonal automatically.
You have now run a complete EDA: you sized up the table, examined single variables, compared groups, studied relationships, scanned all pairs at once, and turned your best finding into a figure built to communicate. That loop — glance, question, chart, notice, repeat — is the whole job, and you can now run it on any tidy dataset. The final page steps back to recap the mental models you built and points you toward where to take them next.