Tidy Data for Seaborn
Why Seaborn wants one variable per column and one observation per row — and how to reshape your data so plotting becomes trivial.
Before you draw a single chart, you have to get the shape of your data right. Almost every frustrating moment with Seaborn — "why won't it color by group?", "why is my x-axis a mess?" — traces back to a table that is shaped the wrong way.
Seaborn is built for tidy data. Once your table is tidy, plotting feels effortless: you just name columns and the roles they should play. When it is not tidy, you fight the library on every line.
This page is foundational. Take your time with it — the payoff is every later chapter.
What "tidy" means
A table is tidy when three rules hold:
- Each variable is its own column.
- Each observation is its own row.
- Each value sits in its own cell.
That's it. A variable is something you measured (temperature, city, species). An observation is one thing you measured it on (one penguin, one day at a restaurant, one month in one city).
Tidy = 'long', not 'wide'
Tidy data is often called long format, because adding more groups makes the table longer (more rows), not wider (more columns). The opposite, wide format, spreads one variable across many columns. Humans often prefer wide tables for reading; Seaborn almost always prefers long.
The transformation, in one picture
Here is the move you will make over and over: reshape a wide table into a long/tidy one so Seaborn can map its columns to visual roles.
Let's make that concrete. Imagine three cities' average temperatures across three months.
A wide table — convenient for a human to scan, one column per city:
| month | London | Tokyo | Cairo |
|---|---|---|---|
| Jan | 5 | 8 | 14 |
| Feb | 6 | 9 | 16 |
| Mar | 9 | 12 | 20 |
The same data, tidy/long — one row per (month, city) observation:
| month | city | temp |
|---|---|---|
| Jan | London | 5 |
| Feb | London | 6 |
| Mar | London | 9 |
| Jan | Tokyo | 8 |
| ... | ... | ... |
Notice what happened: the city — which was hidden in the column headers of the wide table — became an honest variable in its own column. That is the whole trick, because Seaborn can only map things that are columns.
Reshaping wide → long with melt
pandas does the reshaping. DataFrame.melt takes the columns you want to
keep (id_vars) and stacks the rest into two new columns: one holding
the old column names, one holding the values.
Now melt it. Keep month as an identifier; stack London, Tokyo, and
Cairo into a city column and a temp column.
Three rows became nine — the table got longer, and city is now
something we can point Seaborn at.
In the melted long table above, what kind of thing is city?
A column header, exactly as it was before.
A variable stored in its own column, with one value per row.
An index that pandas hides from Seaborn.
Why Seaborn loves long data
Here is the reason all of this matters. Seaborn's whole interface is "assign a column to a visual role":
x="month"— column on the x-axisy="temp"— column on the y-axishue="city"— column that decides color
That last one is only possible because city is a column. With the wide
table there was no single column to hand to hue — the cities were three
separate columns, which is exactly the wrong shape.
One short call drew three colored lines with a legend. Try imagining the
matplotlib version: a loop over cities, three plot calls, manual colors,
a hand-built legend. Tidy data is what buys you the brevity.
Most built-in datasets are already tidy
The datasets we use throughout this course — tips, penguins, mpg —
arrive tidy. Real-world data (spreadsheets, exports, pivot tables) often
does not, so melt is one of the most useful tools in your kit.
When wide is actually the right shape
Tidy is the default, but not a law. A few charts genuinely want a matrix
(wide) layout — most importantly the heatmap, where rows and columns
are both meaningful axes and each cell is a value. We'll meet that in the
correlation-heatmaps chapter. The reverse of melt is pivot, which
spreads a long table back into a wide matrix:
So the rule of thumb is: tidy/long for almost everything; wide/matrix only for heatmaps and similar grid views. When in doubt, go long.
Your turn
The DataFrame scores is wide — one column per subject:
| student | math | reading |
|---|---|---|
| Ana | 88 | 90 |
| ... | ... | ... |
Reshape it into a tidy/long DataFrame named tidy with exactly these
three columns, in this order: student, subject, score. Keep
student as the identifier and stack the subject columns.
Common tidy-data mistakes
- Values trapped in column headers.
2019,2020,2021as separate columns means "year" is a variable hiding in the headers — melt it. - One column holding two variables. A
"London_temp"column mixes city and measurement. Split it (e.g. withstr.split) before plotting. - Aggregates pasted into the data. A "Total" row or column is a summary, not an observation. Seaborn can compute totals for you — drop it from the raw table.
Tidy first, plot second
If a Seaborn call is fighting you — you can't get hue to work, or the
x-axis has the wrong thing on it — stop and look at the table's shape
before touching plot parameters. Nine times out of ten the fix is a melt,
not a new keyword argument.
Check your understanding
Which statement best describes a tidy (long) dataset?
Every dataset that has no missing values.
A dataset stored as a wide matrix so it is easy for people to read.
A dataset where each variable is a column, each observation is a row, and each value is a single cell.
A dataset that has been sorted alphabetically.
You have monthly sales with a separate column for each region:
month, North, South, East, West. You want a single line chart
with one colored line per region. What should you do first?
Pass hue=["North", "South", "East", "West"] to relplot.
melt the four region columns into a region column and a sales column, then map hue="region".
Plot each region with four separate relplot calls.
Sort the DataFrame by month and call relplot on the wide table.
Which of these is a legitimate reason to use a wide (matrix) layout instead of tidy/long?
You want to color a scatter plot by a categorical group.
You want one panel per category with col=.
You are drawing a heatmap, where rows and columns are both axes and each cell is a value.
You want Seaborn to compute group means automatically.
In df.melt(id_vars="date", var_name="metric", value_name="amount"),
what does id_vars control?
The columns that get stacked into the new long columns.
The identifier column(s) to keep as-is, repeated alongside each stacked value.
The new name for the column of values.
The number of rows in the output.
You now have the single most important prerequisite for everything that follows: data in the right shape. Next, let's look at the two kinds of variables Seaborn cares about — continuous and categorical — and how that distinction drives every chart choice.
The Exploratory Data Analysis Workflow
A repeatable loop for exploring a new dataset with plots — from first glance to insight.
Continuous vs. Categorical
The single distinction that drives every chart choice — is a variable a number on a scale, or a label for a group? — and how Seaborn maps each kind.