Dataslope logoDataslope

The dplyr Verbs

Five small verbs — `filter`, `select`, `mutate`, `arrange`, `summarise` — plus the pipe operator. With these, you can express almost any tabular data manipulation in clear, readable English.

So far we've been doing data manipulation in base R: bracket notation, $ access, order(), and so on. It works, but it gets verbose fast. Compare:

# base R
sub <- mtcars[mtcars$mpg > 20 & mtcars$cyl == 4,
              c("mpg", "wt", "hp")]
sub <- sub[order(-sub$mpg), ]

vs.

# dplyr
sub <- mtcars |>
  filter(mpg > 20, cyl == 4) |>
  select(mpg, wt, hp) |>
  arrange(desc(mpg))

The dplyr version reads like instructions: "take mtcars, filter to fuel-efficient four-cylinders, select these three columns, sort by mpg descending."

dplyr is built on five "verbs," each of which does exactly one thing.

The five core verbs

VerbWhat it does
filter()keep rows that match a condition
select()keep, drop, or reorder columns
mutate()add or change columns
arrange()sort rows
summarise()collapse many rows into one summary row

Plus one helper: group_by(), which makes summarise(), mutate(), and friends operate per group. We'll meet that properly on the next page.

You almost never use all five at once. But every dplyr pipeline is some composition of these.

The pipe: |>

The pipe takes the thing on its left and passes it as the first argument to the function on its right.

mtcars |> head()       # same as head(mtcars)
mtcars |> head(3)      # same as head(mtcars, 3)

This sounds trivial. It is not. It lets you write a sequence of transformations top-to-bottom, in reading order, instead of nesting parentheses inside out:

# Without the pipe (read inside out)
head(arrange(filter(mtcars, mpg > 20), desc(mpg)))

# With the pipe (read top to bottom)
mtcars |>
  filter(mpg > 20) |>
  arrange(desc(mpg)) |>
  head()

The second version reads like an English instruction sentence — which is exactly the point. R has two pipes: the native |> (added in R 4.1) and the older %>% from the magrittr package (used widely with dplyr). They're nearly equivalent for our purposes; we'll use |> throughout.

filter() — keep rows

Code Block
R 4.6.0

Notice we wrote mpg, not mtcars$mpg. Inside the dplyr verbs, column names refer to columns of the data frame you piped in. That's a major usability win.

filter() also handles NAs gracefully — it silently drops rows where the condition is NA, which is almost always what you want.

select() — keep columns

Code Block
R 4.6.0

select() has many "helper" functions: starts_with(), ends_with(), contains(), matches(), everything(), where(). They make column selection in wide datasets very ergonomic.

mutate() — add or change columns

Code Block
R 4.6.0

mutate() is for adding derived columns. The new columns can reference any existing column — and any new column defined earlier in the same mutate() call.

To replace a column, mutate with the same name:

Code Block
R 4.6.0

arrange() — sort rows

Code Block
R 4.6.0

summarise() — collapse to a single row

summarise() (or summarize() — both spellings work) takes a data frame and returns a single row containing summary statistics:

Code Block
R 4.6.0

By itself, this isn't more interesting than mean() etc. on their own. The magic happens when you combine it with group_by() — which is the entire next page.

Putting it together: a small pipeline

Code Block
R 4.6.0

Read it top to bottom: "From mtcars, keep four- and six-cylinder cars, compute a power-to-weight ratio, select these five columns, sort by power-to-weight descending, and show me the top 10." Five verbs, one sentence.

Test your understanding

QuestionSelect one

Which dplyr verb would you use to keep only rows where age >= 18?

select()

filter()

arrange()

mutate()

QuestionSelect one

What does the pipe operator |> do?

It runs two commands in parallel.

It passes the value on the left as the first argument to the function on the right, letting you chain operations in reading order.

It compares two values for equality.

It declares a function.

QuestionSelect one

Which verb would you use to add a new column bmi = weight / height^2?

select()

summarise()

filter()

mutate()

Mini challenge: top efficient cars

Using dplyr and the pipe, from mtcars:

  • filter to cars with wt < 3 (lightweight)
  • add a column mpg_per_cyl = mpg / cyl
  • select mpg, cyl, wt, mpg_per_cyl
  • arrange by mpg_per_cyl descending
  • assign the result to top_eff
Challenge
R 4.6.0
A small dplyr pipeline

Use the pipe and four dplyr verbs (filter, mutate, select, arrange) to produce top_eff as described.

summarise() becomes spectacularly useful when paired with group_by(). That's the next page.

On this page