Pair Plots
sns.pairplot — a grid of every numeric variable against every other, the fastest first look at a dataset.
When you meet a new dataset, the first question is rarely about one pair of
columns — it's "how does everything relate to everything?". Drawing a
separate scatter plot for each pair of numeric columns by hand is tedious,
and you'd still have to line them up to compare. sns.pairplot does the
whole grid for you in a single call.
A pair plot is a scatterplot matrix: every numeric column is plotted against every other numeric column, laid out in a square grid. It is the fastest way to go from "I just loaded this table" to "I can see its structure."
What a pair plot draws
Read the grid like a multiplication table. For columns A, B, C, the cell in row A, column B is a scatter plot of A (y-axis) against B (x-axis). So:
- Off-diagonal cells are bivariate scatter plots — one numeric column against another, exactly like the scatter plots you've already seen.
- Diagonal cells would be a variable against itself (a useless straight line), so Seaborn replaces them with that variable's univariate distribution — a histogram by default.
That single layout lets you scan every pairwise relationship and every variable's spread at once.
Penguins has four numeric columns, so you get a 4×4 grid: sixteen panels,
twelve of which are scatter plots and four of which (the diagonal) are
histograms. Already you can spot that body_mass_g and flipper_length_mm
rise together tightly, while bill_depth_mm looks like it might split into
clumps.
It only uses the numeric columns
By default pairplot quietly ignores non-numeric columns like species,
island, and sex — there's no meaningful scatter axis for a category. You
can bring a categorical column back in, but as color, not as an axis.
That's what hue is for, next.
The headline move: color by group with hue
A bare pair plot shows the data as one undifferentiated cloud. The single
most useful thing you can do — the move you'll reach for on almost every new
dataset — is map a categorical column to hue. Two things change at
once:
- Every scatter point is colored by its group, so clusters that belong to different categories separate visually.
- The diagonal switches from one pooled histogram to one distribution per group, overlaid — so you see how each group is spread on each variable.
Look what fell out for free. In the bill_length_mm vs bill_depth_mm
panel the three species form three tidy clusters; on the diagonal you can
see that Gentoo penguins are clearly heavier and have longer flippers than
the other two. You did not compute a single group statistic — you assigned
one column to hue and the separability of the groups revealed itself.
This is exploratory data analysis at its most efficient.
In a pairplot, what is drawn on the diagonal of the grid?
A bivariate scatter of two different numeric columns.
Each variable's own univariate distribution (a histogram by default).
A correlation coefficient printed as a number.
Nothing — the diagonal cells are left blank.
Controlling size and clutter
A pair plot's great weakness is that it grows as the square of the number of numeric columns. With four columns you get sixteen panels; with ten columns you'd get one hundred tiny, unreadable ones. Three parameters keep it under control.
vars=[...] restricts the grid to a chosen subset of columns — the
single most important lever for readability. Pick the handful you actually
care about:
corner=True drops the upper triangle. The grid is symmetric — the
panel for A vs B shows the same relationship as B vs A, just with
the axes swapped — so the upper half is redundant. Hiding it nearly halves
the ink and the clutter:
diag_kind controls the diagonal: "hist" for histograms or "kde"
for smooth density curves. With hue set, "kde" often reads more cleanly
than several overlaid histograms because the curves don't visually collide:
When a pair plot becomes unreadable
Two failure modes, two fixes:
- Too many variables. The panel count is
n², so ten numeric columns means a hundred postage-stamp plots. Usevars=to pick the few that matter, andcorner=Trueto drop the redundant half. - Too many points. Every one of those panels is a scatter, so a large
dataset overplots in all of them at once. Take a sample (e.g.
df.sample(n=2000, random_state=0)) before plotting, or setdiag_kind="kde"and switch the off-diagonals to density (seePairGridbelow).
You call pairplot on a DataFrame with 12 numeric columns and the
result is an unreadable wall of tiny panels. What is the most direct fix?
Increase the figure height so each panel is bigger.
Pass vars=[...] with the handful of columns you actually care about.
Map every column to hue at once.
Set kind="line".
Under the hood: PairGrid
pairplot is a convenient wrapper around a lower-level engine called
PairGrid. PairGrid sets up the same square grid of axes but draws
nothing until you tell it what to put where. You map a plotting function
onto the diagonal and onto the off-diagonal cells yourself:
That reproduces a basic pair plot, but the power is that you can map any
function — a KDE on the lower triangle, a scatter on the upper, a histogram
on the diagonal — for full control. Reach for pairplot when you want a
fast, sensible default (which is most of the time), and drop down to
PairGrid only when you need to customize what each region shows.
What a pair plot shows — and what it doesn't
- Data it needs: several numeric columns, plus an optional
categorical column for
hue. - What it highlights best: all pairwise relationships at once, how
groups separate (with
hue), and each variable's distribution (on the diagonal) — the ideal first sweep of a new dataset. - What it hides: anything that isn't a pairwise, two-variable view. A three-way interaction won't show up, and with many points the individual dots blur together in every panel.
- When it breaks: many columns (
n²tiny panels — subset withvars=) or many rows (every panel overplots — sample, or usecorner/kde).
Your turn
Using the penguins dataset, build a pair plot with sns.pairplot
that:
- is colored by
species(usehue), and - includes only these three columns, via
vars:bill_length_mm,flipper_length_mm, andbody_mass_g.
Assign the result to a variable named g. (Restricting to three columns
keeps it a tidy 3×3 grid.)
Check your understanding
What does a pairplot fundamentally draw?
A single scatter plot of the two most correlated columns.
A scatterplot matrix — every numeric column against every other — with each variable's distribution on the diagonal.
A correlation heatmap of the numeric columns.
A bar chart of each column's mean.
Adding hue="species" to sns.pairplot(penguins, ...) changes the plot
how?
It removes the diagonal distributions to make room for a legend.
It restricts the grid to only the species column.
It colors every scatter point by its group and splits each diagonal into one distribution per group.
It converts the scatter panels into line plots.
What is the practical difference between pairplot and PairGrid?
They are identical; PairGrid is just an alias.
PairGrid only works for two columns at a time.
pairplot is a high-level wrapper with sensible defaults; PairGrid is the lower-level engine where you map your own functions onto the diagonal and off-diagonal cells.
pairplot cannot show hue, but PairGrid can.
A pair plot answers "how does everything relate?" in one glance. Next, we'll zoom from the whole grid down to a single, richly annotated pair of variables with the joint plot — a center plot flanked by each variable's marginal distribution.