Every Geom Has a Stat
The hidden statistical transformation behind every layer — why geom_bar can draw a chart from a single column, and how stats and geoms pair up.
This is one of the most under-appreciated ideas in ggplot2, and grasping it will make a lot of "magic" behavior suddenly make sense. Every layer runs a statistical transformation before it draws. Sometimes that transformation does nothing; sometimes it does a lot.
The mystery of geom_bar
Recall that geom_bar() needs only an x aesthetic, yet it produces
bars with heights. Where do the heights come from? You never supplied
a y.
The answer: geom_bar() has a statistic attached — stat_count —
that runs first. It groups the rows by class, counts them, and
produces a new computed variable, count. That becomes the bar
height. The geom draws bars; the stat invented the y-values.
The layer pipeline
Every layer is really data → stat → geom. The stat sits between your raw data and the drawn marks:
For a scatter plot the stat is stat_identity — it passes the data
through unchanged, which is why geom_point() needs you to supply
both x and y. For a histogram the stat is stat_bin — it slices x into
bins and counts each. The geom is the same idea as always (draw
marks); the stat is what differs.
Geoms and stats are paired — but separable
Each geom ships with a default stat, and each stat ships with a default geom. They are two sides of one layer:
| Layer | Default stat | What the stat computes |
|---|---|---|
geom_point() | identity | nothing — uses x, y as given |
geom_bar() | count | count of rows per x category |
geom_histogram() | bin | counts within bins of a continuous x |
geom_boxplot() | boxplot | quartiles, median, whiskers, outliers |
geom_smooth() | smooth | a fitted line + confidence interval |
geom_density() | density | a kernel density estimate |
Because they are separable, you can override the stat. Telling
geom_bar() to use stat = "identity" makes it stop counting and use
your y directly — which is exactly what geom_col() does under the
hood:
Computed variables: the after-stat values
A stat produces new columns you did not have. stat_count produces
count and prop; stat_bin produces count and density. You can
use these computed variables in mappings via after_stat(). For
example, to make a histogram show density instead of raw counts on
the y-axis:
This is advanced, but the point is conceptual: the stat creates new
variables, and you can reach into them. The default histogram maps
y = after_stat(count) for you; here we asked for density instead.
Why separate stat from geom at all?
Because it makes the system combinatorial. Any stat can, in principle, pair with any geom. A "count" can be drawn as bars, as points, or as a line. Separating the computation from the drawing is the same move that separated mapping from scale — and it is why ggplot2 generalizes so far beyond a fixed menu.
geom_bar() is given only an x aesthetic, yet it draws bars with heights. Where do the heights come from?
ggplot2 picks random heights.
You must have secretly mapped y.
geom_bar()'s default statistic, stat_count, counts the rows in each x category and supplies that count as the bar height.
The heights are always 1.
What is the relationship between geom_col() and geom_bar(stat = "identity")?
They are unrelated; geom_col uses a special coordinate system.
geom_col() counts rows while geom_bar(stat = "identity") does not.
They are equivalent: both use stat = "identity", meaning the data passes through unchanged and the supplied y becomes the bar height.
geom_bar(stat = "identity") is invalid syntax.
Why does ggplot2 separate the statistic from the geom within a layer?
To make plots render more slowly but more accurately.
Because each geom can only ever use one specific stat.
So the computation (e.g. counting, binning, smoothing) is independent of the drawing (bars, points, lines), letting any stat pair with different geoms.
Purely for historical reasons with no practical effect.
Key takeaways
- Every layer runs a stat before drawing:
data → stat → geom. geom_point()usesstat_identity(no change);geom_bar()usesstat_count;geom_histogram()usesstat_bin; and so on.- Stats create new computed variables (like
count,density) that become aesthetics; reach them withafter_stat(). - Stat and geom are separable — overriding the stat is how
geom_col()equalsgeom_bar(stat = "identity").
Layering Multiple Geoms
How to combine several geoms into one figure, share or override mappings per layer, and control inheritance — the craft of multi-layer plots.
Bars and Histograms
Counting and binning in depth — position adjustments, bin width, and why a histogram is a bar chart of a binned statistic.