The dplyr Verbs

So far we've been doing data manipulation in base R: bracket notation, $ access, order(), and so on. It works, but it gets verbose fast. Compare:

# base R
sub <- mtcars[mtcars$mpg > 20 & mtcars$cyl == 4,
              c("mpg", "wt", "hp")]
sub <- sub[order(-sub$mpg), ]

vs.

# dplyr
sub <- mtcars |>
  filter(mpg > 20, cyl == 4) |>
  select(mpg, wt, hp) |>
  arrange(desc(mpg))

The dplyr version reads like instructions: "take mtcars, filter to fuel-efficient four-cylinders, select these three columns, sort by mpg descending."

dplyr is built on five "verbs," each of which does exactly one thing.

The five core verbs

Verb	What it does
`filter()`	keep rows that match a condition
`select()`	keep, drop, or reorder columns
`mutate()`	add or change columns
`arrange()`	sort rows
`summarise()`	collapse many rows into one summary row

Plus one helper: group_by(), which makes summarise(), mutate(), and friends operate per group. We'll meet that properly on the next page.

You almost never use all five at once. But every dplyr pipeline is some composition of these.

The pipe: `|>`

The pipe takes the thing on its left and passes it as the first argument to the function on its right.

mtcars |> head()       # same as head(mtcars)
mtcars |> head(3)      # same as head(mtcars, 3)

This sounds trivial. It is not. It lets you write a sequence of transformations top-to-bottom, in reading order, instead of nesting parentheses inside out:

# Without the pipe (read inside out)
head(arrange(filter(mtcars, mpg > 20), desc(mpg)))

# With the pipe (read top to bottom)
mtcars |>
  filter(mpg > 20) |>
  arrange(desc(mpg)) |>
  head()

The second version reads like an English instruction sentence — which is exactly the point. R has two pipes: the native |> (added in R 4.1) and the older %>% from the magrittr package (used widely with dplyr). They're nearly equivalent for our purposes; we'll use |> throughout.

`filter()` — keep rows

Notice we wrote mpg, not mtcars$mpg. Inside the dplyr verbs, column names refer to columns of the data frame you piped in. That's a major usability win.

filter() also handles NAs gracefully — it silently drops rows where the condition is NA, which is almost always what you want.

`select()` — keep columns

select() has many "helper" functions: starts_with(), ends_with(), contains(), matches(), everything(), where(). They make column selection in wide datasets very ergonomic.

`mutate()` — add or change columns

mutate() is for adding derived columns. The new columns can reference any existing column — and any new column defined earlier in the same mutate() call.

To replace a column, mutate with the same name:

`arrange()` — sort rows

`summarise()` — collapse to a single row

summarise() (or summarize() — both spellings work) takes a data frame and returns a single row containing summary statistics:

By itself, this isn't more interesting than mean() etc. on their own. The magic happens when you combine it with group_by() — which is the entire next page.

Putting it together: a small pipeline

Read it top to bottom: "From mtcars, keep four- and six-cylinder cars, compute a power-to-weight ratio, select these five columns, sort by power-to-weight descending, and show me the top 10." Five verbs, one sentence.

Test your understanding

QuestionSelect one

Which dplyr verb would you use to keep only rows where age >= 18?

select()

filter()

arrange()

mutate()

QuestionSelect one

What does the pipe operator |> do?

It runs two commands in parallel.

It passes the value on the left as the first argument to the function on the right, letting you chain operations in reading order.

It compares two values for equality.

It declares a function.

QuestionSelect one

Which verb would you use to add a new column bmi = weight / height^2?

select()

summarise()

filter()

mutate()

Mini challenge: top efficient cars

Using dplyr and the pipe, from mtcars:

filter to cars with wt < 3 (lightweight)
add a column mpg_per_cyl = mpg / cyl
select mpg, cyl, wt, mpg_per_cyl
arrange by mpg_per_cyl descending
assign the result to top_eff

Use the pipe and four dplyr verbs (filter, mutate, select, arrange) to produce top_eff as described.

summarise() becomes spectacularly useful when paired with group_by(). That's the next page.

The five core verbs

The pipe: |>

filter() — keep rows

select() — keep columns

mutate() — add or change columns

arrange() — sort rows

summarise() — collapse to a single row

Putting it together: a small pipeline

Test your understanding

Mini challenge: top efficient cars

The dplyr Verbs

On this page