The Data Layer
Why ggplot2 expects tidy data frames, how the shape of your data determines what you can map, and how to reshape data so ggplot2 can see it.
Every ggplot starts with data. It is the first argument to
ggplot() and the foundation every other component builds on. Getting
the data into the right shape is half the battle of plotting — and
the half beginners most often skip.
ggplot2 wants a data frame
ggplot2 plots data frames (or tibbles): a rectangle where each row is one observation and each column is one variable. The plot's mappings then refer to columns by name.
You can map any of those columns to an aesthetic: displ to x, hwy
to y, class to color, and so on. The set of columns is the menu of
things you are allowed to map.
Tidy data: the shape ggplot2 loves
ggplot2 works best with tidy data, where:
- Each variable is its own column.
- Each observation is its own row.
This matters because mappings refer to columns. If the variable you want to put on the x-axis is not a column, you cannot map it.
A concrete reshaping example
Suppose sales are stored with one column per year — convenient for a spreadsheet, useless for ggplot2, because "year" is not a column, it is spread across column names.
Now year and sales are real columns, so they can be mapped:
The rule of thumb
If a variable you want to put on an axis or in a legend is not a
single column, reshape the data first. In real projects you would use
tidyr::pivot_longer(); the idea is identical to the reshape() above
— move information out of column names and into column values.
Continuous vs. categorical columns
ggplot2 treats columns differently based on their type, and this ripples through scales, legends, and even which geoms make sense:
| Column type in R | ggplot2 treats it as | Typical use |
|---|---|---|
numeric / integer | continuous | x/y position, size, smooth color gradient |
factor / character | discrete (categorical) | grouping, discrete color, separate bars/facets |
This is why factor() shows up so often in ggplot code. In mpg,
cyl (cylinders) is stored as a number, so ggplot2 gives it a
continuous color gradient. Wrap it in factor() and it becomes
categorical with one distinct color per value:
Same column, same mapping target (color), but a completely different legend and palette — because the type changed from continuous to discrete. Keep this in mind; it explains a surprising number of "why does my plot look wrong?" moments.
Your sales data has separate columns y2019, y2020, y2021, and you want year on the x-axis. Why must you reshape it first?
ggplot2 cannot plot numbers larger than 2018.
ggplot2 requires every data frame to have exactly three columns.
Mappings refer to columns, and "year" currently lives in the column names rather than in a single column of values.
You must convert all the values to characters first.
In mpg, cyl is stored as a number. Mapping color = cyl gives a smooth color gradient, but color = factor(cyl) gives distinct colors per value. Why?
factor() changes the data values themselves.
ggplot2 picks colors at random each run.
A numeric column is treated as continuous (gradient scale), while a factor is treated as discrete (one color per category).
Gradients are only allowed for the x-axis.
Key takeaways
- ggplot2 plots data frames: rows are observations, columns are variables, and mappings refer to columns by name.
- Prefer tidy/long data — if a variable is hiding in column names, reshape it into a column before plotting.
- A column's type matters: numeric → continuous scales, factor /
character → discrete scales. Use
factor()to switch a number to categorical treatment.
Thinking in Layers
Why ggplot2 builds plots by stacking independent layers with +, and how the layered model makes complex figures simple to reason about.
Aesthetic Mappings
What aes() really means — connecting data columns to visual channels like position, color, size, and shape — and why mappings are the heart of the grammar.