Dataslope logoDataslope

Grouped Analysis

The single most powerful idea in data analysis — split your data into groups, compute something per group, combine the results. `group_by()` makes it one line.

Almost every interesting analytical question is a "per-group" question:

  • Average sales per region.
  • Median salary per department.
  • Maximum temperature per month.
  • Win rate per team.
  • Conversion rate per campaign.

The pattern is universal and has a name: split-apply-combine.

In dplyr, you do it in two lines: group_by() + summarise().

The basic pattern

Code Block
R 4.6.0

Read it: "Group the data by cyl. Within each group, compute the mean of mpg. Combine."

Without group_by(), summarise() returns one row total. With group_by(cyl), it returns one row per cyl value. That tiny change is enormous.

Multiple summaries at once

You can compute many summary columns in a single summarise() call:

Code Block
R 4.6.0

n() is a special dplyr helper meaning "how many rows in this group." You will use it constantly.

Multiple grouping variables

Group by more than one column to get all combinations:

Code Block
R 4.6.0

The .groups = "drop" argument is a small dplyr nicety that tells it to un-group the result after summarising. Without it, dplyr prints a friendly message reminding you the result is still grouped. Either way, it's good practice to be explicit.

Counting groups: count()

Counting how many rows are in each group is so common that there's a shortcut: count(). These two are equivalent:

Code Block
R 4.6.0

count() is your go-to for the question "what categories exist, and how many rows of each?"

Grouped mutate(): per-group calculations that keep the rows

group_by() doesn't only work with summarise(). With mutate(), it computes the new column within each group, but keeps every row.

Code Block
R 4.6.0

This pattern — "compute a per-group statistic, then express each row relative to it" — is extremely common. It's how you normalize, rank within groups, compute z-scores per category, etc.

ungroup() is the inverse of group_by(). After you're done with grouped work, ungroup so later code isn't surprised.

Slicing: top N per group

A frequent question: "show me the top N items per group." The verb is slice_max() (and friends).

Code Block
R 4.6.0

slice_max(col, n) keeps the top n rows in each group by some column. Siblings: slice_min(), slice_head(), slice_tail(), slice_sample().

A realistic example: iris by species

iris has 150 flowers, 50 of each of three species. Per-species summary statistics are the natural question:

Code Block
R 4.6.0

One row per species, all summaries on the same line.

Multi-file challenge: monthly air quality

Let's apply this to airquality. We'll structure the analysis across two files: one helper file with reusable functions, and one main script that computes a monthly summary.

Challenge
R 4.6.0
Monthly summary of NYC air quality

Complete main.R so it produces a data frame monthly with one row per month, columns:

  • Month
  • avg_ozone (mean of Ozone, NAs removed)
  • avg_temp (mean of Temp)
  • n_obs (number of rows in that month)

You can use the helper mean_na() from utils.R.

Test your understanding

QuestionSelect one

What's the universal pattern that group_by() |> summarise() implements?

Filter-arrange-join

Split-apply-combine

Map-reduce-filter

Sort-merge-deduplicate

QuestionSelect one

In a dplyr pipeline, what does n() do?

Returns the number of columns.

Returns the number of rows in the current (possibly grouped) data.

Returns the mean of the data.

Returns the row name.

QuestionSelect one

What's the difference between group_by(cyl) |> summarise(avg = mean(mpg)) and group_by(cyl) |> mutate(avg = mean(mpg))?

They're the same.

summarise is slower than mutate.

summarise returns one row per group (collapses); mutate keeps every row and broadcasts the group statistic into each row.

mutate cannot be used with group_by.

The last page in this section covers reshaping — switching between wide and long forms when your data isn't yet tidy.

On this page