Tidy Data Principles
The most important conceptual idea in modern data analysis — a simple, three-rule recipe for shaping data so that every tool just works.
In 2014, Hadley Wickham wrote a paper that quietly changed how a whole generation of analysts works. It introduced a deceptively simple idea called tidy data.
A dataset is tidy when:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
That's it. Three rules. They sound obvious. They are not — most real-world data violates them.
The payoff: once your data is tidy, every modern R tool just
works. dplyr, ggplot2, tidyr, and almost every statistical
modeling function expects tidy input. Getting your data tidy is
80% of the analysis.
A tidy example
Every column is a variable (city, year, population). Every row is an observation (a city in a year). This is the form everything downstream wants.
What "messy" looks like
Here is the same data, but wide — years spread across columns:
What's wrong with this? Two things:
- "Year" is a variable, but it's hidden in the column names
y2022,y2023. To compute the average population across years, you can't writemean(year)— there is noyearcolumn. - The same variable (population) lives in multiple columns. This makes vectorized operations awkward and plotting nearly impossible.
Wide data is fine for display (humans like compact tables). But for computation, you almost always want it tidy / long.
The four messy patterns
The original "Tidy Data" paper identified four common ways data is messy:
| Pattern | Problem |
|---|---|
| Column headers are values, not variable names | e.g. y2022, y2023 columns |
| Multiple variables stored in one column | e.g. "age_sex" = "30M" |
| Variables in both rows and columns | e.g. min and max temps as rows, dates as columns |
| Multiple types of observational units in one table | e.g. song info repeated for each play |
Almost any real-world messiness fits one of these.
We'll meet the actual reshaping tools (pivot_longer,
pivot_wider) on the "Reshaping Data" page later. For now, the
goal is just to recognize tidy versus messy.
A side-by-side comparison
The same idea — average population per city across years — written two ways:
Both produce the right answer. But the tidy version uses a
general operation (aggregate(pop ~ city)) that scales to any
number of years; the wide version requires explicitly enumerating
the year columns. Add a y2024 column, and the wide solution
breaks; the tidy solution does not.
"Observational unit" — what to put in one table
Rule 3 says: each type of observation gets its own table. Concrete example: imagine a music library where you record every time someone plays a song.
A bad table might combine song info and play info:
| play_id | played_at | song_title | album | artist | duration_sec |
|---|---|---|---|---|---|
| 1 | 2026-05-01 09:00 | Yesterday | Help! | The Beatles | 125 |
| 2 | 2026-05-01 09:02 | Yesterday | Help! | The Beatles | 125 |
The song title, album, artist, and duration are repeated for every play. That's wasteful and dangerous — fix a typo in one row and now two rows disagree.
The tidy approach: split into two tables.
- songs: one row per song, columns: song_id, title, album, artist, duration_sec
- plays: one row per play, columns: play_id, played_at, song_id
Then join them when you need to. The "one table per observational unit" rule prevents huge classes of data inconsistency.
Why tidy data is especially powerful with ggplot2
Here is a concrete payoff. ggplot2 is the visualization library
we'll spend a whole section on later. Its central idea is to map
columns to visual properties. That mapping only works when
data is tidy.
Want a line chart of population over time, colored by city?
ggplot(tidy, aes(x = year, y = pop, color = city)) +
geom_line()That single expression works because year is a column, pop
is a column, and city is a column. Try doing the same with the
wide form — you can't, without reshaping first.
Two acid tests
Whenever you're unsure if your data is tidy, ask yourself:
- Could I compute a new summary by saying "group by X and summarize Y"? If yes, you're in tidy territory. If "X" is hidden as multiple columns or "Y" is hidden as column names, you're not.
- Would adding a new value (a new year, a new product, a new patient) require adding a new row, not a new column? If it requires a new row, you're tidy. If it requires a new column, you're wide.
If your data fails either test, plan to reshape it before going further.
Test your understanding
A dataset is tidy when:
Each variable is a column, each observation is a row, and each observational unit is its own table.
Each value is rounded to two decimals.
Missing values have been removed.
The data has been sorted.
A spreadsheet has columns: country, gdp_2020, gdp_2021, gdp_2022. Is it tidy?
Yes — every row is one country.
No — "year" is hidden in the column names, so year is not a variable in its own right.
Yes — every column has a name.
It depends on how many countries there are.
Why does tidy data make analysis easier?
It's mandatory in R.
It makes the file smaller.
Modern tools (dplyr, ggplot2, modeling functions) all assume one row per observation and one column per variable, so tidy data "just works" with them.
It removes missing values.
Mini challenge: spot the tidy version
You'll receive two data frames representing the same information.
Assign tidy_df to whichever one is tidy.
Two data frames a and b are provided. They contain the same data, but one is tidy and the other is wide. Set tidy_df to the tidy one.
We now know what tidy data is and why it matters. The next
section introduces the modern R toolkit for getting and
keeping your data tidy: dplyr.
Subsetting and Filtering
How to ask a dataset the question "show me only the rows I care about, and only the columns I need" — the everyday operation of data analysis.
The dplyr Verbs
Five small verbs — `filter`, `select`, `mutate`, `arrange`, `summarise` — plus the pipe operator. With these, you can express almost any tabular data manipulation in clear, readable English.