Practical R for Beginners

The Age of Data Statistics Before Computers The Rise of Statistical Computing From S to R Why R Matters Today Reproducible Analysis

Thinking in Data Your First R Program R as a Calculator Variables and Assignment

Vectors Everywhere Vectorized Computation Logical and Character Vectors Missing Values (NA)

Data Frames Inspecting a Dataset Subsetting and Filtering Tidy Data Principles

The dplyr Verbs Grouped Analysis Reshaping Data

Summary Statistics Exploring Distributions Relationships Between Variables

Principles of Visualization The ggplot2 Grammar Interpreting Plots

Uncertainty and Variability Sampling and Distributions Intuition for Inference

Writing Your Own Functions Scripts and Projects

Mini Project Walkthrough Next Steps

Subsetting and Filtering

How to ask a dataset the question "show me only the rows I care about, and only the columns I need" — the everyday operation of data analysis.

The single most common operation in data analysis is some flavor of: keep some of the rows, keep some of the columns, ignore the rest.

R has more than one way to do this. On this page we'll learn the base R way — using [ , ] — because it builds directly on what you already know about vectors. On the next pages we'll meet dplyr, which gives a more readable syntax for the same ideas.

The mental model: `[rows, cols]`

The bracket on a data frame takes two arguments separated by a comma: which rows, which columns. An empty slot means all.

Selecting columns

Code Block

R 4.6.0

The "single column returns a vector" surprise trips beginners up. If you want to always get back a data frame, either select multiple columns or pass drop = FALSE.

Selecting rows

Code Block

R 4.6.0

The third form — passing a logical vector of the same length as the number of rows — is the workhorse. Every "filter" you'll ever do is some variation of it.

Selecting rows AND columns at once

You can do both in one expression:

Code Block

R 4.6.0

Read these out loud: "from mtcars, take the rows where mpg > 25, and from those, just the mpg, cyl, and wt columns." The bracket notation maps cleanly onto the sentence.

A common pitfall: `==` vs `%in%`

If you want to match against a set of values, don't try to combine many == with |. Use %in%:

Code Block

R 4.6.0

%in% is one of R's nicest small operators. It's vectorized: "for each element of the left side, is it found anywhere in the right side?"

A common pitfall: missing values in the condition

If your filter condition involves a column with NAs, those rows become problematic — NA > 5 is NA, not FALSE, and indexing with NA yields a row of all NAs.

Code Block

R 4.6.0

The fix is to always guard with !is.na(col) when filtering on a column that might be missing. (dplyr::filter(), which we'll meet soon, does this for you.)

Sorting

Sorting is a sibling of subsetting. You sort a data frame by ordering its rows according to one or more columns. order() is the helper:

Code Block

R 4.6.0

order() returns the positions that would sort the vector; you then use those positions as row indices.

Test your understanding

QuestionSelect one

What is the convention for indexing a data frame with brackets?

df[cols, rows]

df[rows, cols]

df[rows][cols]

df.rows.cols

QuestionSelect one

Which expression filters iris to rows where the species is either "setosa" or "versicolor"?

iris[iris$Species == "setosa" == "versicolor", ]

iris[iris$Species %in% c("setosa", "versicolor"), ]

iris[iris$Species = "setosa" | "versicolor", ]

iris[in(iris$Species, "setosa", "versicolor"), ]

QuestionSelect one

Why is airquality[airquality$Ozone > 100, ] potentially dangerous?

It's not — it works perfectly.

Ozone contains NAs, and NA > 100 evaluates to NA, which produces rows of all NAs in the result.

It sorts the data instead of filtering it.

Ozone is character, so the comparison errors.

Mini challenge: heavy cars with poor fuel economy

From mtcars, build heavy_inefficient: a data frame containing only the rows where wt > 4 and mpg < 18, and only the columns mpg, cyl, and wt.

Challenge

R 4.6.0

Filter and select

Use base R subsetting to filter mtcars to cars with wt > 4 and mpg < 18, keeping only the mpg, cyl, and wt columns. Assign the result to heavy_inefficient.

We've now seen the bracket way to wrangle data. It is powerful but verbose. The next big topic is tidy data — a principle for how to shape your data so wrangling becomes easy.

Inspecting a Dataset

Before you analyze a dataset, you have to *meet* it. The five-minute ritual every analyst performs the moment a new dataset lands on their desk.

Tidy Data Principles

The most important conceptual idea in modern data analysis — a simple, three-rule recipe for shaping data so that every tool just works.

On this page

The mental model: [rows, cols]Selecting columns Selecting rows Selecting rows AND columns at once A common pitfall: == vs %in%A common pitfall: missing values in the condition Sorting Test your understanding Mini challenge: heavy cars with poor fuel economy