Dataslope logoDataslope

The Data Layer

Why ggplot2 expects tidy data frames, how the shape of your data determines what you can map, and how to reshape data so ggplot2 can see it.

Every ggplot starts with data. It is the first argument to ggplot() and the foundation every other component builds on. Getting the data into the right shape is half the battle of plotting — and the half beginners most often skip.

ggplot2 wants a data frame

ggplot2 plots data frames (or tibbles): a rectangle where each row is one observation and each column is one variable. The plot's mappings then refer to columns by name.

Code Block
R 4.6.0

You can map any of those columns to an aesthetic: displ to x, hwy to y, class to color, and so on. The set of columns is the menu of things you are allowed to map.

Tidy data: the shape ggplot2 loves

ggplot2 works best with tidy data, where:

  1. Each variable is its own column.
  2. Each observation is its own row.

This matters because mappings refer to columns. If the variable you want to put on the x-axis is not a column, you cannot map it.

A concrete reshaping example

Suppose sales are stored with one column per year — convenient for a spreadsheet, useless for ggplot2, because "year" is not a column, it is spread across column names.

Code Block
R 4.6.0

Now year and sales are real columns, so they can be mapped:

Code Block
R 4.6.0

The rule of thumb

If a variable you want to put on an axis or in a legend is not a single column, reshape the data first. In real projects you would use tidyr::pivot_longer(); the idea is identical to the reshape() above — move information out of column names and into column values.

Continuous vs. categorical columns

ggplot2 treats columns differently based on their type, and this ripples through scales, legends, and even which geoms make sense:

Column type in Rggplot2 treats it asTypical use
numeric / integercontinuousx/y position, size, smooth color gradient
factor / characterdiscrete (categorical)grouping, discrete color, separate bars/facets

This is why factor() shows up so often in ggplot code. In mpg, cyl (cylinders) is stored as a number, so ggplot2 gives it a continuous color gradient. Wrap it in factor() and it becomes categorical with one distinct color per value:

Code Block
R 4.6.0
Code Block
R 4.6.0

Same column, same mapping target (color), but a completely different legend and palette — because the type changed from continuous to discrete. Keep this in mind; it explains a surprising number of "why does my plot look wrong?" moments.

QuestionSelect one

Your sales data has separate columns y2019, y2020, y2021, and you want year on the x-axis. Why must you reshape it first?

ggplot2 cannot plot numbers larger than 2018.

ggplot2 requires every data frame to have exactly three columns.

Mappings refer to columns, and "year" currently lives in the column names rather than in a single column of values.

You must convert all the values to characters first.

QuestionSelect one

In mpg, cyl is stored as a number. Mapping color = cyl gives a smooth color gradient, but color = factor(cyl) gives distinct colors per value. Why?

factor() changes the data values themselves.

ggplot2 picks colors at random each run.

A numeric column is treated as continuous (gradient scale), while a factor is treated as discrete (one color per category).

Gradients are only allowed for the x-axis.

Key takeaways

  • ggplot2 plots data frames: rows are observations, columns are variables, and mappings refer to columns by name.
  • Prefer tidy/long data — if a variable is hiding in column names, reshape it into a column before plotting.
  • A column's type matters: numeric → continuous scales, factor / character → discrete scales. Use factor() to switch a number to categorical treatment.

On this page