Wide vs Long
The shape of your data shapes the code you write. Long format unlocks Pandas's superpowers; wide format is easier to read.
The exact same information can be laid out two very different ways. Knowing how — and when — to switch between them is one of the most empowering skills in data analysis.
A concrete example
A small survey asking three people about their happiness across three years can be stored two ways:
Wide format
| person | 2022 | 2023 | 2024 |
|---|---|---|---|
| Aiko | 7 | 8 | 8 |
| Bilal | 6 | 7 | 9 |
| Chen | 9 | 8 | 7 |
Long format
| person | year | happiness |
|---|---|---|
| Aiko | 2022 | 7 |
| Aiko | 2023 | 8 |
| Aiko | 2024 | 8 |
| Bilal | 2022 | 6 |
| ... | ... | ... |
Both are "the same data." But the second form — long, or tidy — is what almost every analytical and plotting library prefers.
The tidy data principles
Hadley Wickham's famous tidy data rules:
- Each variable is a column.
- Each observation is a row.
- Each type of observational unit is a table.
In the wide table, 2022, 2023, 2024 are values of a
variable (year) — but they are sitting in column names. That's
the tell-tale sign of "untidy" data.
melt converts wide-to-long. pivot (or pivot_table) does the
reverse.
Melt — wide to long
Pivot — long to wide
pivot requires the (index, columns) combinations to be
unique. If you have duplicates that need aggregation, reach
for pivot_table (next page).
Why long is usually better for analysis
Long format plays better with:
- GroupBy — group by year, average happiness:
long.groupby("year")["happiness"].mean() - Plotting libraries (Plotly, Seaborn) — pass
x="year",y="happiness",color="person". - Joining other long datasets.
With wide data, year is "trapped" in column names. You'd have to extract it manually before you could do anything time-aware.
Why wide is sometimes better
Wide format is easier for humans to read. It's the format of spreadsheets, of reports, of dashboards.
A common rhythm:
- Store the canonical data in long form.
- Compute and analyse in long form.
- Pivot to wide for the final presentation.
A practical melt with stub columns
This is the typical post-melt cleanup: get the variable out of the column name, then enrich it (parse, sort, type-cast).
Mini challenge
Given the wide DataFrame temps (city × month), produce a long DataFrame called long with these exact columns:
city(string)month(string — "jan", "feb", "mar")temp(numeric)
It should have 9 rows (3 cities × 3 months).
Check your understanding
In the wide table person | 2022 | 2023 | 2024, why is "year" considered a hidden variable?
It is not
Years are integers
The years are sitting in column names, not in a column of values — to do anything year-aware you must first lift them out
The data is corrupted
Which operation reshapes wide → long?
pivot
merge
melt
concat
Why do plotting libraries usually prefer long format?
They reject wide DataFrames
It uses less memory
They map columns to plot aesthetics (x, y, color, facet) — a single tidy column for each variable plugs straight in
Wide format is deprecated