Seaborn Foundations: Visualizing Statistical Data

Why Visualize Statistical Data?How Seaborn Thinks The Exploratory Data Analysis Workflow

Tidy Data for Seaborn Continuous vs. Categorical

Loading Data and Setting a Theme Figure-Level vs. Axes-Level

Scatter Plots Line Plots

Histograms KDE and Density ECDF and Rug Plots

The Categorical Plot Family Bar and Count Plots Box and Violin Plots Strip and Swarm Plots

Regression and Trend Lines

Correlation Heatmaps Pair Plots Joint Plots Facets and Grids

Themes and Context Color Palettes Light Touches with Matplotlib Telling a Story with Data

Capstone: A Full EDA Next Steps

Scatter Plots

The chart for seeing how two numeric variables relate — correlation, clusters, and outliers — plus how hue, style, and size add more dimensions.

A scatter plot puts one numeric variable on the x-axis, another on the y-axis, and draws a single dot for every row. It is the most direct way to answer the question "how do these two measurements relate?" — and it is often the very first plot you should reach for in exploration.

Both axes use position, which is the encoding the human eye reads most accurately. That is why a scatter plot can reveal correlation, clusters, gaps, and outliers all at once, in about half a second.

What a scatter plot needs and shows

Data it needs: two numeric (continuous) variables — one for x, one for y. One dot per observation.
What it highlights best: the shape of a relationship — is it increasing or decreasing? linear or curved? tight or loose? Are there separate clusters or stray outliers?
What it hides: exact counts where dots overlap, and any variable you did not map. It shows individuals, not summaries.
When it breaks: with tens of thousands of points the dots pile up into a solid blob (overplotting), and with too many color groups the legend becomes a guessing game. We'll fix both below.

The minimum

relplot ("relational plot") is Seaborn's figure-level entry point for scatter and line charts. With kind="scatter" (the default) you get a scatter plot:

Code Block

Python 3.13.2

That cloud already tells a story: bill length and depth are loosely related, and there seem to be two or three blobs. But what are those blobs? The plot can't say — until we add another variable.

A third variable with `hue`

Mapping a column to hue colors each dot by its group. This is where scatter plots become genuinely powerful: a hidden grouping snaps into view.

Code Block

Python 3.13.2

The mystery blobs were species all along. Within each species the two measurements are positively related, even though the overall cloud looked shapeless. (This reversal — a trend inside groups that vanishes or flips when you pool them — is worth remembering; it's the visual face of Simpson's paradox.)

hue can be categorical or numeric

If you map hue to a categorical column (like species), Seaborn picks distinct colors and builds a category legend. If you map it to a numeric column, Seaborn uses a continuous color gradient and shows a colorbar instead. The same parameter, two behaviors — chosen by the column's type.

Even more dimensions: `style` and `size`

You can push extra variables onto a scatter plot with style (marker shape, for a categorical variable) and size (marker area, for a numeric one):

Code Block

Python 3.13.2

Four variables on one chart. Impressive — but pause. Can you actually read the body mass of a single dot from its size? Or tell a circle from a cross at a glance? Usually not. Each extra channel costs the viewer effort.

More channels is not more insight

Position (x, y) is read precisely. Color (hue) is read well for a handful of categories. Shape (style) and size are read poorly. Reserve them for a secondary variable you only need roughly, and resist the urge to map something to every channel just because you can. A clear three-variable plot beats a cluttered five-variable one.

Overplotting: when the dots pile up

A scatter plot's weakness is density. With thousands of overlapping points the busy regions all render as the same solid color, and you lose exactly the structure you came to see. The first and easiest fix is alpha (opacity): semi-transparent dots make dense regions visibly darker.

Code Block

Python 3.13.2

Lower the alpha and the cloud reveals where points actually concentrate. Try editing it to alpha=1.0 and back to feel the difference. For truly dense data, you would switch encodings entirely — to a 2-D histogram or a density plot, which we'll meet in the distributions chapters.

`relplot` vs `scatterplot`

You'll see two ways to draw the same scatter:

sns.relplot(..., kind="scatter") — figure-level. Returns a grid, manages its own figure, and can split into panels with col/row.
sns.scatterplot(...) — axes-level. Draws onto a single matplotlib Axes you can combine with other plots.

They produce the same dots; they differ in what they return and how they compose. We devote a whole page to that distinction — for now, reach for relplot when you might want facets, and scatterplot when you're placing one chart on an Axes you control.

Your turn

Challenge

Python 3.13.2

Color a scatter by group

Using the penguins dataset, draw a scatter plot with sns.relplot:

bill_length_mm on the x-axis,
bill_depth_mm on the y-axis,
colored by species (use hue).

Assign the result to a variable named g.

Check your understanding

QuestionSelect one

What is a scatter plot fundamentally best at revealing?

The composition of a whole into parts.

A single variable's frequency distribution.

The relationship between two numeric variables — its direction, shape, clustering, and outliers.

A ranking of categories from largest to smallest.

QuestionSelect one

Your scatter plot of 40,000 points looks like one solid dark blob. Which fix most directly addresses the problem?

Increase the marker size so the points are easier to see.

Lower the opacity with alpha (e.g. alpha=0.2), or switch to a density-style plot.

Remove the axis gridlines.

Sort the data before plotting.

QuestionSelect one

You map hue, style, AND size to three different variables on one scatter plot. What is the main risk?

Seaborn will raise an error because only one extra channel is allowed.

The plot will be statistically incorrect.

Shape and size are read imprecisely, so the extra channels add clutter faster than insight.

The legend will be hidden automatically.

You can now see how two variables relate and layer in a third. Next we keep the relational family going with line plots — what changes when the x-axis represents an ordered progression like time.

Figure-Level vs. Axes-Level

The one structural idea behind Seaborn's whole API — why some functions facet and return a grid while others draw on a single Axes — and how to choose.

Line Plots

relplot with kind='line' for ordered data — trends, automatic aggregation, and confidence bands.

On this page

What a scatter plot needs and shows The minimum A third variable with hueEven more dimensions: style and sizeOverplotting: when the dots pile up relplot vs scatterplotYour turn Check your understanding