Dataslope logoDataslope

Tidy Data Principles

The most important conceptual idea in modern data analysis — a simple, three-rule recipe for shaping data so that every tool just works.

In 2014, Hadley Wickham wrote a paper that quietly changed how a whole generation of analysts works. It introduced a deceptively simple idea called tidy data.

A dataset is tidy when:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

That's it. Three rules. They sound obvious. They are not — most real-world data violates them.

The payoff: once your data is tidy, every modern R tool just works. dplyr, ggplot2, tidyr, and almost every statistical modeling function expects tidy input. Getting your data tidy is 80% of the analysis.

A tidy example

Code Block
R 4.6.0

Every column is a variable (city, year, population). Every row is an observation (a city in a year). This is the form everything downstream wants.

What "messy" looks like

Here is the same data, but wide — years spread across columns:

Code Block
R 4.6.0

What's wrong with this? Two things:

  1. "Year" is a variable, but it's hidden in the column names y2022, y2023. To compute the average population across years, you can't write mean(year) — there is no year column.
  2. The same variable (population) lives in multiple columns. This makes vectorized operations awkward and plotting nearly impossible.

Wide data is fine for display (humans like compact tables). But for computation, you almost always want it tidy / long.

The four messy patterns

The original "Tidy Data" paper identified four common ways data is messy:

PatternProblem
Column headers are values, not variable namese.g. y2022, y2023 columns
Multiple variables stored in one columne.g. "age_sex" = "30M"
Variables in both rows and columnse.g. min and max temps as rows, dates as columns
Multiple types of observational units in one tablee.g. song info repeated for each play

Almost any real-world messiness fits one of these.

We'll meet the actual reshaping tools (pivot_longer, pivot_wider) on the "Reshaping Data" page later. For now, the goal is just to recognize tidy versus messy.

A side-by-side comparison

The same idea — average population per city across years — written two ways:

Code Block
R 4.6.0
Code Block
R 4.6.0

Both produce the right answer. But the tidy version uses a general operation (aggregate(pop ~ city)) that scales to any number of years; the wide version requires explicitly enumerating the year columns. Add a y2024 column, and the wide solution breaks; the tidy solution does not.

"Observational unit" — what to put in one table

Rule 3 says: each type of observation gets its own table. Concrete example: imagine a music library where you record every time someone plays a song.

A bad table might combine song info and play info:

play_idplayed_atsong_titlealbumartistduration_sec
12026-05-01 09:00YesterdayHelp!The Beatles125
22026-05-01 09:02YesterdayHelp!The Beatles125

The song title, album, artist, and duration are repeated for every play. That's wasteful and dangerous — fix a typo in one row and now two rows disagree.

The tidy approach: split into two tables.

  • songs: one row per song, columns: song_id, title, album, artist, duration_sec
  • plays: one row per play, columns: play_id, played_at, song_id

Then join them when you need to. The "one table per observational unit" rule prevents huge classes of data inconsistency.

Why tidy data is especially powerful with ggplot2

Here is a concrete payoff. ggplot2 is the visualization library we'll spend a whole section on later. Its central idea is to map columns to visual properties. That mapping only works when data is tidy.

Want a line chart of population over time, colored by city?

ggplot(tidy, aes(x = year, y = pop, color = city)) +
  geom_line()

That single expression works because year is a column, pop is a column, and city is a column. Try doing the same with the wide form — you can't, without reshaping first.

Two acid tests

Whenever you're unsure if your data is tidy, ask yourself:

  1. Could I compute a new summary by saying "group by X and summarize Y"? If yes, you're in tidy territory. If "X" is hidden as multiple columns or "Y" is hidden as column names, you're not.
  2. Would adding a new value (a new year, a new product, a new patient) require adding a new row, not a new column? If it requires a new row, you're tidy. If it requires a new column, you're wide.

If your data fails either test, plan to reshape it before going further.

Test your understanding

QuestionSelect one

A dataset is tidy when:

Each variable is a column, each observation is a row, and each observational unit is its own table.

Each value is rounded to two decimals.

Missing values have been removed.

The data has been sorted.

QuestionSelect one

A spreadsheet has columns: country, gdp_2020, gdp_2021, gdp_2022. Is it tidy?

Yes — every row is one country.

No — "year" is hidden in the column names, so year is not a variable in its own right.

Yes — every column has a name.

It depends on how many countries there are.

QuestionSelect one

Why does tidy data make analysis easier?

It's mandatory in R.

It makes the file smaller.

Modern tools (dplyr, ggplot2, modeling functions) all assume one row per observation and one column per variable, so tidy data "just works" with them.

It removes missing values.

Mini challenge: spot the tidy version

You'll receive two data frames representing the same information. Assign tidy_df to whichever one is tidy.

Challenge
R 4.6.0
Pick the tidy one

Two data frames a and b are provided. They contain the same data, but one is tidy and the other is wide. Set tidy_df to the tidy one.

We now know what tidy data is and why it matters. The next section introduces the modern R toolkit for getting and keeping your data tidy: dplyr.

On this page