Scatter Plots
The chart for seeing how two numeric variables relate — correlation, clusters, and outliers — plus how hue, style, and size add more dimensions.
A scatter plot puts one numeric variable on the x-axis, another on the y-axis, and draws a single dot for every row. It is the most direct way to answer the question "how do these two measurements relate?" — and it is often the very first plot you should reach for in exploration.
Both axes use position, which is the encoding the human eye reads most accurately. That is why a scatter plot can reveal correlation, clusters, gaps, and outliers all at once, in about half a second.
What a scatter plot needs and shows
- Data it needs: two numeric (continuous) variables — one for x, one for y. One dot per observation.
- What it highlights best: the shape of a relationship — is it increasing or decreasing? linear or curved? tight or loose? Are there separate clusters or stray outliers?
- What it hides: exact counts where dots overlap, and any variable you did not map. It shows individuals, not summaries.
- When it breaks: with tens of thousands of points the dots pile up into a solid blob (overplotting), and with too many color groups the legend becomes a guessing game. We'll fix both below.
The minimum
relplot ("relational plot") is Seaborn's figure-level entry point for
scatter and line charts. With kind="scatter" (the default) you get a
scatter plot:
That cloud already tells a story: bill length and depth are loosely related, and there seem to be two or three blobs. But what are those blobs? The plot can't say — until we add another variable.
A third variable with hue
Mapping a column to hue colors each dot by its group. This is where
scatter plots become genuinely powerful: a hidden grouping snaps into view.
The mystery blobs were species all along. Within each species the two measurements are positively related, even though the overall cloud looked shapeless. (This reversal — a trend inside groups that vanishes or flips when you pool them — is worth remembering; it's the visual face of Simpson's paradox.)
hue can be categorical or numeric
If you map hue to a categorical column (like species), Seaborn picks
distinct colors and builds a category legend. If you map it to a numeric
column, Seaborn uses a continuous color gradient and shows a colorbar
instead. The same parameter, two behaviors — chosen by the column's type.
Even more dimensions: style and size
You can push extra variables onto a scatter plot with style (marker
shape, for a categorical variable) and size (marker area, for a
numeric one):
Four variables on one chart. Impressive — but pause. Can you actually read the body mass of a single dot from its size? Or tell a circle from a cross at a glance? Usually not. Each extra channel costs the viewer effort.
More channels is not more insight
Position (x, y) is read precisely. Color (hue) is read well for a handful
of categories. Shape (style) and size are read poorly. Reserve them
for a secondary variable you only need roughly, and resist the urge to map
something to every channel just because you can. A clear three-variable plot
beats a cluttered five-variable one.
Overplotting: when the dots pile up
A scatter plot's weakness is density. With thousands of overlapping points
the busy regions all render as the same solid color, and you lose exactly
the structure you came to see. The first and easiest fix is alpha
(opacity): semi-transparent dots make dense regions visibly darker.
Lower the alpha and the cloud reveals where points actually concentrate.
Try editing it to alpha=1.0 and back to feel the difference. For truly
dense data, you would switch encodings entirely — to a 2-D histogram or a
density plot, which we'll meet in the distributions chapters.
relplot vs scatterplot
You'll see two ways to draw the same scatter:
sns.relplot(..., kind="scatter")— figure-level. Returns a grid, manages its own figure, and can split into panels withcol/row.sns.scatterplot(...)— axes-level. Draws onto a single matplotlib Axes you can combine with other plots.
They produce the same dots; they differ in what they return and how they
compose. We devote a whole page to that distinction — for now, reach for
relplot when you might want facets, and scatterplot when you're placing
one chart on an Axes you control.
Your turn
Using the penguins dataset, draw a scatter plot with
sns.relplot:
bill_length_mmon the x-axis,bill_depth_mmon the y-axis,- colored by
species(usehue).
Assign the result to a variable named g.
Check your understanding
What is a scatter plot fundamentally best at revealing?
The composition of a whole into parts.
A single variable's frequency distribution.
The relationship between two numeric variables — its direction, shape, clustering, and outliers.
A ranking of categories from largest to smallest.
Your scatter plot of 40,000 points looks like one solid dark blob. Which fix most directly addresses the problem?
Increase the marker size so the points are easier to see.
Lower the opacity with alpha (e.g. alpha=0.2), or switch to a density-style plot.
Remove the axis gridlines.
Sort the data before plotting.
You map hue, style, AND size to three different variables on one
scatter plot. What is the main risk?
Seaborn will raise an error because only one extra channel is allowed.
The plot will be statistically incorrect.
Shape and size are read imprecisely, so the extra channels add clutter faster than insight.
The legend will be hidden automatically.
You can now see how two variables relate and layer in a third. Next we keep the relational family going with line plots — what changes when the x-axis represents an ordered progression like time.