The dplyr Verbs
Five small verbs — `filter`, `select`, `mutate`, `arrange`, `summarise` — plus the pipe operator. With these, you can express almost any tabular data manipulation in clear, readable English.
So far we've been doing data manipulation in base R: bracket
notation, $ access, order(), and so on. It works, but it gets
verbose fast. Compare:
# base R
sub <- mtcars[mtcars$mpg > 20 & mtcars$cyl == 4,
c("mpg", "wt", "hp")]
sub <- sub[order(-sub$mpg), ]vs.
# dplyr
sub <- mtcars |>
filter(mpg > 20, cyl == 4) |>
select(mpg, wt, hp) |>
arrange(desc(mpg))The dplyr version reads like instructions: "take mtcars, filter to fuel-efficient four-cylinders, select these three columns, sort by mpg descending."
dplyr is built on five "verbs," each of which does exactly one thing.
The five core verbs
| Verb | What it does |
|---|---|
filter() | keep rows that match a condition |
select() | keep, drop, or reorder columns |
mutate() | add or change columns |
arrange() | sort rows |
summarise() | collapse many rows into one summary row |
Plus one helper: group_by(), which makes summarise(),
mutate(), and friends operate per group. We'll meet that
properly on the next page.
You almost never use all five at once. But every dplyr pipeline is some composition of these.
The pipe: |>
The pipe takes the thing on its left and passes it as the first argument to the function on its right.
mtcars |> head() # same as head(mtcars)
mtcars |> head(3) # same as head(mtcars, 3)This sounds trivial. It is not. It lets you write a sequence of transformations top-to-bottom, in reading order, instead of nesting parentheses inside out:
# Without the pipe (read inside out)
head(arrange(filter(mtcars, mpg > 20), desc(mpg)))
# With the pipe (read top to bottom)
mtcars |>
filter(mpg > 20) |>
arrange(desc(mpg)) |>
head()The second version reads like an English instruction sentence —
which is exactly the point. R has two pipes: the native |>
(added in R 4.1) and the older %>% from the magrittr package
(used widely with dplyr). They're nearly equivalent for our
purposes; we'll use |> throughout.
filter() — keep rows
Notice we wrote mpg, not mtcars$mpg. Inside the dplyr verbs,
column names refer to columns of the data frame you piped in.
That's a major usability win.
filter() also handles NAs gracefully — it silently drops rows
where the condition is NA, which is almost always what you
want.
select() — keep columns
select() has many "helper" functions: starts_with(),
ends_with(), contains(), matches(), everything(),
where(). They make column selection in wide datasets very
ergonomic.
mutate() — add or change columns
mutate() is for adding derived columns. The new columns can
reference any existing column — and any new column defined
earlier in the same mutate() call.
To replace a column, mutate with the same name:
arrange() — sort rows
summarise() — collapse to a single row
summarise() (or summarize() — both spellings work) takes a
data frame and returns a single row containing summary
statistics:
By itself, this isn't more interesting than mean() etc. on
their own. The magic happens when you combine it with
group_by() — which is the entire next page.
Putting it together: a small pipeline
Read it top to bottom: "From mtcars, keep four- and six-cylinder cars, compute a power-to-weight ratio, select these five columns, sort by power-to-weight descending, and show me the top 10." Five verbs, one sentence.
Test your understanding
Which dplyr verb would you use to keep only rows where age >= 18?
select()
filter()
arrange()
mutate()
What does the pipe operator |> do?
It runs two commands in parallel.
It passes the value on the left as the first argument to the function on the right, letting you chain operations in reading order.
It compares two values for equality.
It declares a function.
Which verb would you use to add a new column bmi = weight / height^2?
select()
summarise()
filter()
mutate()
Mini challenge: top efficient cars
Using dplyr and the pipe, from mtcars:
- filter to cars with
wt < 3(lightweight) - add a column
mpg_per_cyl = mpg / cyl - select
mpg,cyl,wt,mpg_per_cyl - arrange by
mpg_per_cyldescending - assign the result to
top_eff
Use the pipe and four dplyr verbs (filter, mutate, select, arrange) to produce top_eff as described.
summarise() becomes spectacularly useful when paired with
group_by(). That's the next page.
Tidy Data Principles
The most important conceptual idea in modern data analysis — a simple, three-rule recipe for shaping data so that every tool just works.
Grouped Analysis
The single most powerful idea in data analysis — split your data into groups, compute something per group, combine the results. `group_by()` makes it one line.