Subsetting and Filtering
How to ask a dataset the question "show me only the rows I care about, and only the columns I need" — the everyday operation of data analysis.
The single most common operation in data analysis is some flavor of: keep some of the rows, keep some of the columns, ignore the rest.
R has more than one way to do this. On this page we'll learn the
base R way — using [ , ] — because it builds directly on
what you already know about vectors. On the next pages we'll meet
dplyr, which gives a more readable syntax for the same ideas.
The mental model: [rows, cols]
The bracket on a data frame takes two arguments separated by a comma: which rows, which columns. An empty slot means all.
Selecting columns
The "single column returns a vector" surprise trips beginners up.
If you want to always get back a data frame, either select
multiple columns or pass drop = FALSE.
Selecting rows
The third form — passing a logical vector of the same length as the number of rows — is the workhorse. Every "filter" you'll ever do is some variation of it.
Selecting rows AND columns at once
You can do both in one expression:
Read these out loud: "from mtcars, take the rows where mpg > 25, and from those, just the mpg, cyl, and wt columns." The bracket notation maps cleanly onto the sentence.
A common pitfall: == vs %in%
If you want to match against a set of values, don't try to
combine many == with |. Use %in%:
%in% is one of R's nicest small operators. It's vectorized:
"for each element of the left side, is it found anywhere in the
right side?"
A common pitfall: missing values in the condition
If your filter condition involves a column with NAs, those rows
become problematic — NA > 5 is NA, not FALSE, and indexing
with NA yields a row of all NAs.
The fix is to always guard with !is.na(col) when filtering on a
column that might be missing. (dplyr::filter(), which we'll meet
soon, does this for you.)
Sorting
Sorting is a sibling of subsetting. You sort a data frame by
ordering its rows according to one or more columns. order() is
the helper:
order() returns the positions that would sort the vector; you
then use those positions as row indices.
Test your understanding
What is the convention for indexing a data frame with brackets?
df[cols, rows]
df[rows, cols]
df[rows][cols]
df.rows.cols
Which expression filters iris to rows where the species is either "setosa" or "versicolor"?
iris[iris$Species == "setosa" == "versicolor", ]
iris[iris$Species %in% c("setosa", "versicolor"), ]
iris[iris$Species = "setosa" | "versicolor", ]
iris[in(iris$Species, "setosa", "versicolor"), ]
Why is airquality[airquality$Ozone > 100, ] potentially dangerous?
It's not — it works perfectly.
Ozone contains NAs, and NA > 100 evaluates to NA, which produces rows of all NAs in the result.
It sorts the data instead of filtering it.
Ozone is character, so the comparison errors.
Mini challenge: heavy cars with poor fuel economy
From mtcars, build heavy_inefficient: a data frame containing
only the rows where wt > 4 and mpg < 18, and only the columns
mpg, cyl, and wt.
Use base R subsetting to filter mtcars to cars with wt > 4 and mpg < 18, keeping only the mpg, cyl, and wt columns. Assign the result to heavy_inefficient.
We've now seen the bracket way to wrangle data. It is powerful but verbose. The next big topic is tidy data — a principle for how to shape your data so wrangling becomes easy.
Inspecting a Dataset
Before you analyze a dataset, you have to *meet* it. The five-minute ritual every analyst performs the moment a new dataset lands on their desk.
Tidy Data Principles
The most important conceptual idea in modern data analysis — a simple, three-rule recipe for shaping data so that every tool just works.