Scatter Plots
The chart for seeing the relationship between two variables — and many extensions of it
A scatter plot maps two quantitative variables to the x and y axes and draws one dot per row. It is, on the perceptual ranking we met earlier, the most powerful chart in the world: both axes use the most accurate visual encoding (position), so the eye can read correlation, clusters, outliers, and shape at a glance.
If you forget every other chart, remember the scatter plot.
When to use a scatter plot
- You want to see relationship between two quantitative variables: correlated? non-linear? clustered?
- You want to spot outliers — points far from everyone else.
- You want to compare individuals (one dot per row), not aggregates.
- You'll later add a third or fourth variable through color or size (which makes it a bubble chart — its own page later).
The minimum
The picture answers "are these two measurements related?" in about half a second.
Adding categorical color
Color reveals groups. The classic iris example becomes much richer once we color by species:
Suddenly the structure becomes obvious: the three species occupy different regions of the chart. That's the magic of adding the third variable.
Adding size for a fourth variable
A scatter where the size of each dot encodes another quantitative variable is sometimes called a bubble chart (we'll dedicate a whole page to it later):
Four variables on one chart — and still readable.
Trendlines
When you want to summarize a relationship, add trendline="ols"
(ordinary least squares) or trendline="lowess" (a smoothed local
regression):
Trendlines turn a noisy cloud into a clear direction. Use them when the question is "is there a relationship?" and the cloud is busy.
Overplotting: when there are too many points
Scatter plots can break down when you have many thousands of points — they pile on top of each other and the dense regions all look equally black. Two fixes:
Opacity
Each dot is now semi-transparent, so dense regions become visibly darker.
Density heatmap
For really dense data, switch to a density heatmap (which we'll meet in its own chapter), where the chart shows how many points are in each region of the plane.
A typical scatter-plot mistake
Drawing a scatter plot with one categorical axis is usually not
what you want — it produces tall stacks of dots above each category
label. Use a strip plot (px.strip), violin plot
(px.violin), or box plot (px.box) instead — we'll cover
boxes next.
Check your understanding
What is a scatter plot best at showing?
A part-of-a-whole breakdown.
A trend across regular time intervals.
The relationship between two quantitative variables — correlation, clusters, and outliers.
The composition of categorical groups.
You have a scatter plot of 50,000 data points and everything looks like a black blob. What is a sensible fix?
Switch to a pie chart.
Increase the marker size.
Lower the marker opacity (e.g., opacity=0.3) or switch to a 2-D density heatmap.
Hide the axis labels.
What is a trendline on a scatter plot useful for?
It changes the underlying data.
It applies a database join.
It summarizes the central tendency of the relationship between x and y, making it easier to see direction (positive, negative, none) and to compare it across groups.
It removes outliers.