Dataslope logoDataslope

Tidy Data for Seaborn

Why Seaborn wants one variable per column and one observation per row — and how to reshape your data so plotting becomes trivial.

Before you draw a single chart, you have to get the shape of your data right. Almost every frustrating moment with Seaborn — "why won't it color by group?", "why is my x-axis a mess?" — traces back to a table that is shaped the wrong way.

Seaborn is built for tidy data. Once your table is tidy, plotting feels effortless: you just name columns and the roles they should play. When it is not tidy, you fight the library on every line.

This page is foundational. Take your time with it — the payoff is every later chapter.

What "tidy" means

A table is tidy when three rules hold:

  1. Each variable is its own column.
  2. Each observation is its own row.
  3. Each value sits in its own cell.

That's it. A variable is something you measured (temperature, city, species). An observation is one thing you measured it on (one penguin, one day at a restaurant, one month in one city).

Tidy = 'long', not 'wide'

Tidy data is often called long format, because adding more groups makes the table longer (more rows), not wider (more columns). The opposite, wide format, spreads one variable across many columns. Humans often prefer wide tables for reading; Seaborn almost always prefers long.

The transformation, in one picture

Here is the move you will make over and over: reshape a wide table into a long/tidy one so Seaborn can map its columns to visual roles.

Let's make that concrete. Imagine three cities' average temperatures across three months.

A wide table — convenient for a human to scan, one column per city:

monthLondonTokyoCairo
Jan5814
Feb6916
Mar91220

The same data, tidy/long — one row per (month, city) observation:

monthcitytemp
JanLondon5
FebLondon6
MarLondon9
JanTokyo8
.........

Notice what happened: the city — which was hidden in the column headers of the wide table — became an honest variable in its own column. That is the whole trick, because Seaborn can only map things that are columns.

Reshaping wide → long with melt

pandas does the reshaping. DataFrame.melt takes the columns you want to keep (id_vars) and stacks the rest into two new columns: one holding the old column names, one holding the values.

Code Block
Python 3.13.2

Now melt it. Keep month as an identifier; stack London, Tokyo, and Cairo into a city column and a temp column.

Code Block
Python 3.13.2

Three rows became nine — the table got longer, and city is now something we can point Seaborn at.

QuestionSelect one

In the melted long table above, what kind of thing is city?

A column header, exactly as it was before.

A variable stored in its own column, with one value per row.

An index that pandas hides from Seaborn.

Why Seaborn loves long data

Here is the reason all of this matters. Seaborn's whole interface is "assign a column to a visual role":

  • x="month" — column on the x-axis
  • y="temp" — column on the y-axis
  • hue="city" — column that decides color

That last one is only possible because city is a column. With the wide table there was no single column to hand to hue — the cities were three separate columns, which is exactly the wrong shape.

Code Block
Python 3.13.2

One short call drew three colored lines with a legend. Try imagining the matplotlib version: a loop over cities, three plot calls, manual colors, a hand-built legend. Tidy data is what buys you the brevity.

Most built-in datasets are already tidy

The datasets we use throughout this course — tips, penguins, mpg — arrive tidy. Real-world data (spreadsheets, exports, pivot tables) often does not, so melt is one of the most useful tools in your kit.

When wide is actually the right shape

Tidy is the default, but not a law. A few charts genuinely want a matrix (wide) layout — most importantly the heatmap, where rows and columns are both meaningful axes and each cell is a value. We'll meet that in the correlation-heatmaps chapter. The reverse of melt is pivot, which spreads a long table back into a wide matrix:

Code Block
Python 3.13.2

So the rule of thumb is: tidy/long for almost everything; wide/matrix only for heatmaps and similar grid views. When in doubt, go long.

Your turn

Challenge
Python 3.13.2
Tidy up a gradebook

The DataFrame scores is wide — one column per subject:

studentmathreading
Ana8890
.........

Reshape it into a tidy/long DataFrame named tidy with exactly these three columns, in this order: student, subject, score. Keep student as the identifier and stack the subject columns.

Common tidy-data mistakes

  • Values trapped in column headers. 2019, 2020, 2021 as separate columns means "year" is a variable hiding in the headers — melt it.
  • One column holding two variables. A "London_temp" column mixes city and measurement. Split it (e.g. with str.split) before plotting.
  • Aggregates pasted into the data. A "Total" row or column is a summary, not an observation. Seaborn can compute totals for you — drop it from the raw table.

Tidy first, plot second

If a Seaborn call is fighting you — you can't get hue to work, or the x-axis has the wrong thing on it — stop and look at the table's shape before touching plot parameters. Nine times out of ten the fix is a melt, not a new keyword argument.

Check your understanding

QuestionSelect one

Which statement best describes a tidy (long) dataset?

Every dataset that has no missing values.

A dataset stored as a wide matrix so it is easy for people to read.

A dataset where each variable is a column, each observation is a row, and each value is a single cell.

A dataset that has been sorted alphabetically.

QuestionSelect one

You have monthly sales with a separate column for each region: month, North, South, East, West. You want a single line chart with one colored line per region. What should you do first?

Pass hue=["North", "South", "East", "West"] to relplot.

melt the four region columns into a region column and a sales column, then map hue="region".

Plot each region with four separate relplot calls.

Sort the DataFrame by month and call relplot on the wide table.

QuestionSelect one

Which of these is a legitimate reason to use a wide (matrix) layout instead of tidy/long?

You want to color a scatter plot by a categorical group.

You want one panel per category with col=.

You are drawing a heatmap, where rows and columns are both axes and each cell is a value.

You want Seaborn to compute group means automatically.

QuestionSelect one

In df.melt(id_vars="date", var_name="metric", value_name="amount"), what does id_vars control?

The columns that get stacked into the new long columns.

The identifier column(s) to keep as-is, repeated alongside each stacked value.

The new name for the column of values.

The number of rows in the output.

You now have the single most important prerequisite for everything that follows: data in the right shape. Next, let's look at the two kinds of variables Seaborn cares about — continuous and categorical — and how that distinction drives every chart choice.

On this page