Capstone: A Full EDA

Put it all together — explore the penguins dataset end to end, from first glance to a communication-ready figure.

Every chapter so far taught one chart family in isolation. Real exploratory data analysis (EDA) does not work one chart at a time — it is a loop. You glance at the raw table, ask a question, draw the chart that answers it, notice something new, and ask the next question. The skill is not knowing twenty chart types; it is knowing which one to reach for at each step, and reading what it reveals before moving on.

This page is a full pass through that loop on the penguins dataset — 344 penguins measured across three species and three islands. We'll go from "what is even in this table?" to a single, polished figure you could drop into a report. Along the way you will write a lot of the code yourself.

The EDA loop, in one breath

Glance at the data, look at one variable at a time, compare groups, study relationships between pairs, then zoom out to all pairs at once — and finally turn your best finding into a figure that communicates. Each step narrows the question for the next.

Step 1 — First glance

Before any chart, look at the table itself. Three quick checks answer "what am I working with?": the shape (how many rows and columns), the head (what the columns and values actually look like), and the missing values (where the holes are). Skipping this step is how people end up plotting a column that is half empty and drawing the wrong conclusion.

Read the output top to bottom. The shape is (344, 7): 344 penguins, seven columns. The head shows the columns split cleanly into who the penguin is (species, island, sex — categorical labels) and how it was measured (bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g — numbers). That split is the single most useful thing to notice, because it tells you which charts are even possible: numeric columns can go on a scatter plot's axes; categorical columns can color or split it.

The missing-value count shows a handful of NaNs — two penguins are missing every measurement, and a few more are missing sex. This is normal, and Seaborn quietly drops missing rows per plot, so we can keep moving. But now we know the holes are there instead of being surprised later.

Categorical vs numeric is the first fork

The moment you can sort a table's columns into "labels" and "numbers" you already know your options: a histogram or box plot for one numeric column, a scatter for two numeric columns, hue/col for the categorical ones. The data types drive the chart choice — not the other way around.

Step 2 — One variable at a time

With the lay of the land clear, start narrow: look at a single numeric column on its own. A histogram (via displot with kind="hist") turns one column of numbers into a shape — where values pile up, how spread out they are, whether there is more than one peak.

Here is body_mass_g, the penguins' weight:

The shape is not a single tidy hump. There is a big cluster of lighter penguins on the left and a smaller, heavier group trailing off to the right — a hint that more than one kind of penguin is mixed into this column. That hint is exactly the kind of thing EDA exists to surface: a single-variable plot raised a question ("who are the heavy ones?") that the next steps will answer.

Now you try the same move on a different column — and split it by species so the groups show up in color.

Using the penguins dataset, draw a histogram of flipper_length_mm with sns.displot:

put flipper_length_mm on the x-axis,
use kind="hist",
color the bars by species with hue="species".

Assign the result to a variable named g.

Running your version, the colors tell the story the body-mass histogram only hinted at: the right-hand group is almost entirely one species. A single variable, split by a category, already starts separating the penguins into kinds.

Step 3 — Comparing groups

The histogram suggested the species differ. The natural next question is by how much, and on which measurements? When one axis is a category and the other is a number, that is a job for catplot. A box plot (kind="box") summarizes each group's spread — the box is the middle 50%, the line is the median, the whiskers reach most of the rest — so you can line the groups up and compare centers and overlap at a glance.

Body mass by species:

Now the suspicion is confirmed and quantified: Gentoo penguins are clearly heavier than Adelie and Chinstrap, whose boxes sit lower and overlap each other heavily. The box plot earns its keep here — three distributions compared in one glance, with medians and spread visible at once. (Swap kind="box" for kind="violin" to see each group's full density shape instead of a five-number summary.)

The other categorical column we noticed in Step 1 was island. Does weight vary by where a penguin lives, the way it varies by species? Find out.

Using the penguins dataset, draw a categorical plot with sns.catplot comparing body_mass_g across islands:

put island on the x-axis,
put body_mass_g on the y-axis,
use kind="box" (or kind="violin" if you prefer).

Assign the result to a variable named g.

Your island plot shows that Biscoe runs heavier than the others — but be careful with the why. Islands are not equally stocked with species (Gentoo live only on Biscoe), so "Biscoe penguins are heavy" is really "Gentoo live on Biscoe, and Gentoo are heavy." A grouping variable can be a stand-in for a different cause. EDA is as much about catching these confounds as about finding patterns.

Step 4 — Relationships between pairs

Single variables and group comparisons are warm-ups. The richest questions in this dataset are about how two measurements move together. Back to the scatter plot (relplot), now armed with what we know — color by species so the structure is unmistakable.

Flipper length against body mass:

This is one of the cleanest relationships you will ever see: longer flippers go with heavier bodies, tightly and almost linearly, and the three species fall into their own regions of the cloud. Gentoo sit up and to the right (long flippers, heavy); Adelie and Chinstrap overlap in the lower-left. The scatter plot shows the relationship; hue shows that the relationship is shared across groups that nonetheless occupy different parts of it.

When a relationship looks linear, it is natural to ask for the line. lmplot draws a scatter plus a fitted regression line per group — a quick way to compare trends. Try it on the two bill measurements, which (you may recall from earlier) hide a surprise when you split by species.

Using the penguins dataset, draw a regression plot with sns.lmplot:

put bill_length_mm on the x-axis,
put bill_depth_mm on the y-axis,
color (and fit a separate line) by species with hue="species".

Assign the result to a variable named g.

Look at the three fitted lines. Within each species, longer bills are also deeper — every line slopes upward. Yet if you erased the colors and fitted one line to the whole cloud, the trend would look flat or even downward, because Gentoo (long, shallow bills) pull the overall picture sideways. That reversal — a within-group trend that vanishes or flips when groups are pooled — is Simpson's paradox, and lmplot with hue makes it visible in a single chart.

Step 5 — All pairs at once

You have been picking pairs of columns by hand. With only four numeric columns there are not that many pairs — so why not look at all of them at once? A pair plot draws every numeric column against every other in a grid, with each variable's own distribution down the diagonal. It is the fastest way to scan an entire numeric dataset for structure.

In one figure you can confirm everything the earlier steps found and spot things you had not looked for: flipper-vs-mass is tight and linear; the bill measurements separate the species into neat clusters; and the diagonal shows how each measurement alone is split by species. A pair plot is busy, but for a handful of numeric columns it is an unbeatable overview.

A complementary view of "everything at once" is the correlation matrix: a single number per pair of columns summarizing how strongly they move together, from -1 (perfect opposite) through 0 (no linear relation) to +1 (perfect together). A heatmap turns that matrix into color.

Two choices make this heatmap honest. The palette "vlag" is diverging — it runs from one color through white to another — and center=0 pins white to zero, so positive correlations read as one hue and negative ones as the opposite, with strength shown by intensity. A sequential palette, or forgetting center=0, would make a +0.1 and a -0.1 look misleadingly different in size. The bright cell between flipper length and body mass (near +0.87) is the strong relationship you already saw as a scatter — now quantified.

That corr matrix has structure worth understanding directly, not just coloring. Build it yourself and check two facts that are true of every correlation matrix.

A correlation matrix is square (one row and one column per numeric variable) and every variable correlates perfectly with itself, so its diagonal is all 1.

Using the penguins dataset, build the correlation matrix of its numeric columns and assign it to a variable named corr:

corr = penguins.corr(numeric_only=True)

The tests will confirm that corr is square and that its top-left entry (a variable's correlation with itself) is 1.

The diagonal being exactly 1 and the matrix being square are not coincidences — they fall out of what correlation is. Knowing that lets you read a heatmap critically: the diagonal carries no information (it is always the brightest, "perfect" color), so your eye should go straight to the off-diagonal cells where the real relationships live.

Step 6 — Communicate the finding

Exploration is for you; a final figure is for someone else. The last move of an EDA is to take the single most important thing you found and draw it so a reader gets it in seconds — with a message-style title that states the takeaway, not just the variable names.

Our cleanest finding was that the species separate by bill shape. Here is that scatter, dressed for an audience: a title that says what the chart means, and clear axis labels.

Compare this to the bare scatter from Step 4. The data is identical; what changed is that a reader no longer has to derive the point — the title hands it to them, and the labels read in plain English. That is the difference between an exploratory figure (terse, fast, for yourself) and an explanatory one (titled, labeled, for an audience). Both are legitimate; knowing which you are making is the skill.

Title with the verb, not the nouns

"bill_length_mm vs bill_depth_mm" describes the axes. "Penguin species separate cleanly by bill shape" describes the finding. For a figure meant to persuade or inform, prefer the sentence with a verb in it — it tells the reader what to take away before they finish reading the axes.

Check your understanding

QuestionSelect one

You open a brand-new dataset you have never seen. Which step makes the most sense first, before drawing any chart?

Fit a regression line to the two columns whose names sound related.

Inspect .shape, .head(), and .isna().sum() to learn the size, the columns, and where values are missing.

Immediately draw a pair plot of every column against every other.

Pick a color palette so your charts look polished from the start.

QuestionSelect one

You want to compare the body_mass_g distribution across the three penguin species and read each group's median and spread at a glance. Which chart fits the question best?

A scatter plot of body_mass_g against flipper_length_mm.

A single histogram of body_mass_g with no grouping.

A box (or violin) plot with species on one axis and body_mass_g on the other.

A correlation heatmap of the numeric columns.

QuestionSelect one

On a correlation heatmap, why is it important to use a diverging palette with center=0 (for example cmap="vlag", center=0)?

It makes the figure load faster.

It pins a neutral color to zero so positive and negative correlations read as opposite hues, with strength shown by intensity.

It forces every correlation to fall between 0 and 1.

It hides the diagonal automatically.

You have now run a complete EDA: you sized up the table, examined single variables, compared groups, studied relationships, scanned all pairs at once, and turned your best finding into a figure built to communicate. That loop — glance, question, chart, notice, repeat — is the whole job, and you can now run it on any tidy dataset. The final page steps back to recap the mental models you built and points you toward where to take them next.

Capstone: A Full EDA

Step 1 — First glance

Step 2 — One variable at a time

Step 3 — Comparing groups

Step 4 — Relationships between pairs

Step 5 — All pairs at once

Step 6 — Communicate the finding

Check your understanding

On this page