Dataslope logoDataslope

Subsetting and Filtering

How to ask a dataset the question "show me only the rows I care about, and only the columns I need" — the everyday operation of data analysis.

The single most common operation in data analysis is some flavor of: keep some of the rows, keep some of the columns, ignore the rest.

R has more than one way to do this. On this page we'll learn the base R way — using [ , ] — because it builds directly on what you already know about vectors. On the next pages we'll meet dplyr, which gives a more readable syntax for the same ideas.

The mental model: [rows, cols]

The bracket on a data frame takes two arguments separated by a comma: which rows, which columns. An empty slot means all.

Selecting columns

Code Block
R 4.6.0

The "single column returns a vector" surprise trips beginners up. If you want to always get back a data frame, either select multiple columns or pass drop = FALSE.

Selecting rows

Code Block
R 4.6.0

The third form — passing a logical vector of the same length as the number of rows — is the workhorse. Every "filter" you'll ever do is some variation of it.

Selecting rows AND columns at once

You can do both in one expression:

Code Block
R 4.6.0

Read these out loud: "from mtcars, take the rows where mpg > 25, and from those, just the mpg, cyl, and wt columns." The bracket notation maps cleanly onto the sentence.

A common pitfall: == vs %in%

If you want to match against a set of values, don't try to combine many == with |. Use %in%:

Code Block
R 4.6.0

%in% is one of R's nicest small operators. It's vectorized: "for each element of the left side, is it found anywhere in the right side?"

A common pitfall: missing values in the condition

If your filter condition involves a column with NAs, those rows become problematic — NA > 5 is NA, not FALSE, and indexing with NA yields a row of all NAs.

Code Block
R 4.6.0

The fix is to always guard with !is.na(col) when filtering on a column that might be missing. (dplyr::filter(), which we'll meet soon, does this for you.)

Sorting

Sorting is a sibling of subsetting. You sort a data frame by ordering its rows according to one or more columns. order() is the helper:

Code Block
R 4.6.0

order() returns the positions that would sort the vector; you then use those positions as row indices.

Test your understanding

QuestionSelect one

What is the convention for indexing a data frame with brackets?

df[cols, rows]

df[rows, cols]

df[rows][cols]

df.rows.cols

QuestionSelect one

Which expression filters iris to rows where the species is either "setosa" or "versicolor"?

iris[iris$Species == "setosa" == "versicolor", ]

iris[iris$Species %in% c("setosa", "versicolor"), ]

iris[iris$Species = "setosa" | "versicolor", ]

iris[in(iris$Species, "setosa", "versicolor"), ]

QuestionSelect one

Why is airquality[airquality$Ozone > 100, ] potentially dangerous?

It's not — it works perfectly.

Ozone contains NAs, and NA > 100 evaluates to NA, which produces rows of all NAs in the result.

It sorts the data instead of filtering it.

Ozone is character, so the comparison errors.

Mini challenge: heavy cars with poor fuel economy

From mtcars, build heavy_inefficient: a data frame containing only the rows where wt > 4 and mpg < 18, and only the columns mpg, cyl, and wt.

Challenge
R 4.6.0
Filter and select

Use base R subsetting to filter mtcars to cars with wt > 4 and mpg < 18, keeping only the mpg, cyl, and wt columns. Assign the result to heavy_inefficient.

We've now seen the bracket way to wrangle data. It is powerful but verbose. The next big topic is tidy data — a principle for how to shape your data so wrangling becomes easy.

On this page