Grouped Analysis
The single most powerful idea in data analysis — split your data into groups, compute something per group, combine the results. `group_by()` makes it one line.
Almost every interesting analytical question is a "per-group" question:
- Average sales per region.
- Median salary per department.
- Maximum temperature per month.
- Win rate per team.
- Conversion rate per campaign.
The pattern is universal and has a name: split-apply-combine.
In dplyr, you do it in two lines: group_by() + summarise().
The basic pattern
Read it: "Group the data by cyl. Within each group, compute the
mean of mpg. Combine."
Without group_by(), summarise() returns one row total. With
group_by(cyl), it returns one row per cyl value. That tiny
change is enormous.
Multiple summaries at once
You can compute many summary columns in a single summarise()
call:
n() is a special dplyr helper meaning "how many rows in this
group." You will use it constantly.
Multiple grouping variables
Group by more than one column to get all combinations:
The .groups = "drop" argument is a small dplyr nicety that tells
it to un-group the result after summarising. Without it, dplyr
prints a friendly message reminding you the result is still
grouped. Either way, it's good practice to be explicit.
Counting groups: count()
Counting how many rows are in each group is so common that there's
a shortcut: count(). These two are equivalent:
count() is your go-to for the question "what categories exist,
and how many rows of each?"
Grouped mutate(): per-group calculations that keep the rows
group_by() doesn't only work with summarise(). With mutate(),
it computes the new column within each group, but keeps every
row.
This pattern — "compute a per-group statistic, then express each row relative to it" — is extremely common. It's how you normalize, rank within groups, compute z-scores per category, etc.
ungroup() is the inverse of group_by(). After you're done
with grouped work, ungroup so later code isn't surprised.
Slicing: top N per group
A frequent question: "show me the top N items per group." The
verb is slice_max() (and friends).
slice_max(col, n) keeps the top n rows in each group by some
column. Siblings: slice_min(), slice_head(), slice_tail(),
slice_sample().
A realistic example: iris by species
iris has 150 flowers, 50 of each of three species. Per-species
summary statistics are the natural question:
One row per species, all summaries on the same line.
Multi-file challenge: monthly air quality
Let's apply this to airquality. We'll structure the analysis
across two files: one helper file with reusable functions, and
one main script that computes a monthly summary.
Complete main.R so it produces a data frame monthly with one row per month, columns:
Monthavg_ozone(mean of Ozone, NAs removed)avg_temp(mean of Temp)n_obs(number of rows in that month)
You can use the helper mean_na() from utils.R.
Test your understanding
What's the universal pattern that group_by() |> summarise() implements?
Filter-arrange-join
Split-apply-combine
Map-reduce-filter
Sort-merge-deduplicate
In a dplyr pipeline, what does n() do?
Returns the number of columns.
Returns the number of rows in the current (possibly grouped) data.
Returns the mean of the data.
Returns the row name.
What's the difference between group_by(cyl) |> summarise(avg = mean(mpg)) and group_by(cyl) |> mutate(avg = mean(mpg))?
They're the same.
summarise is slower than mutate.
summarise returns one row per group (collapses); mutate keeps every row and broadcasts the group statistic into each row.
mutate cannot be used with group_by.
The last page in this section covers reshaping — switching between wide and long forms when your data isn't yet tidy.
The dplyr Verbs
Five small verbs — `filter`, `select`, `mutate`, `arrange`, `summarise` — plus the pipe operator. With these, you can express almost any tabular data manipulation in clear, readable English.
Reshaping Data
How to flip data between "wide" and "long" forms — the missing skill that turns most messy real-world data into something tidy you can actually work with.