Practical R for Beginners

The Age of Data Statistics Before Computers The Rise of Statistical Computing From S to R Why R Matters Today Reproducible Analysis

Thinking in Data Your First R Program R as a Calculator Variables and Assignment

Vectors Everywhere Vectorized Computation Logical and Character Vectors Missing Values (NA)

Data Frames Inspecting a Dataset Subsetting and Filtering Tidy Data Principles

The dplyr Verbs Grouped Analysis Reshaping Data

Summary Statistics Exploring Distributions Relationships Between Variables

Principles of Visualization The ggplot2 Grammar Interpreting Plots

Uncertainty and Variability Sampling and Distributions Intuition for Inference

Writing Your Own Functions Scripts and Projects

Mini Project Walkthrough Next Steps

Grouped Analysis

The single most powerful idea in data analysis — split your data into groups, compute something per group, combine the results. `group_by()` makes it one line.

Almost every interesting analytical question is a "per-group" question:

Average sales per region.
Median salary per department.
Maximum temperature per month.
Win rate per team.
Conversion rate per campaign.

The pattern is universal and has a name: split-apply-combine.

In dplyr, you do it in two lines: group_by() + summarise().

The basic pattern

Code Block

R 4.6.0

Read it: "Group the data by cyl. Within each group, compute the mean of mpg. Combine."

Without group_by(), summarise() returns one row total. With group_by(cyl), it returns one row per cyl value. That tiny change is enormous.

Multiple summaries at once

You can compute many summary columns in a single summarise() call:

Code Block

R 4.6.0

n() is a special dplyr helper meaning "how many rows in this group." You will use it constantly.

Multiple grouping variables

Group by more than one column to get all combinations:

Code Block

R 4.6.0

The .groups = "drop" argument is a small dplyr nicety that tells it to un-group the result after summarising. Without it, dplyr prints a friendly message reminding you the result is still grouped. Either way, it's good practice to be explicit.

Counting groups: `count()`

Counting how many rows are in each group is so common that there's a shortcut: count(). These two are equivalent:

Code Block

R 4.6.0

count() is your go-to for the question "what categories exist, and how many rows of each?"

Grouped `mutate()`: per-group calculations that keep the rows

group_by() doesn't only work with summarise(). With mutate(), it computes the new column within each group, but keeps every row.

Code Block

R 4.6.0

This pattern — "compute a per-group statistic, then express each row relative to it" — is extremely common. It's how you normalize, rank within groups, compute z-scores per category, etc.

ungroup() is the inverse of group_by(). After you're done with grouped work, ungroup so later code isn't surprised.

Slicing: top N per group

A frequent question: "show me the top N items per group." The verb is slice_max() (and friends).

Code Block

R 4.6.0

slice_max(col, n) keeps the top n rows in each group by some column. Siblings: slice_min(), slice_head(), slice_tail(), slice_sample().

A realistic example: iris by species

iris has 150 flowers, 50 of each of three species. Per-species summary statistics are the natural question:

Code Block

R 4.6.0

One row per species, all summaries on the same line.

Multi-file challenge: monthly air quality

Let's apply this to airquality. We'll structure the analysis across two files: one helper file with reusable functions, and one main script that computes a monthly summary.

Challenge

R 4.6.0

Monthly summary of NYC air quality

Complete main.R so it produces a data frame monthly with one row per month, columns:

Month
avg_ozone (mean of Ozone, NAs removed)
avg_temp (mean of Temp)
n_obs (number of rows in that month)

You can use the helper mean_na() from utils.R.

Test your understanding

QuestionSelect one

What's the universal pattern that group_by() |> summarise() implements?

Filter-arrange-join

Split-apply-combine

Map-reduce-filter

Sort-merge-deduplicate

QuestionSelect one

In a dplyr pipeline, what does n() do?

Returns the number of columns.

Returns the number of rows in the current (possibly grouped) data.

Returns the mean of the data.

Returns the row name.

QuestionSelect one

What's the difference between group_by(cyl) |> summarise(avg = mean(mpg)) and group_by(cyl) |> mutate(avg = mean(mpg))?

They're the same.

summarise is slower than mutate.

summarise returns one row per group (collapses); mutate keeps every row and broadcasts the group statistic into each row.

mutate cannot be used with group_by.

The last page in this section covers reshaping — switching between wide and long forms when your data isn't yet tidy.

The dplyr Verbs

Five small verbs — `filter`, `select`, `mutate`, `arrange`, `summarise` — plus the pipe operator. With these, you can express almost any tabular data manipulation in clear, readable English.

Reshaping Data

How to flip data between "wide" and "long" forms — the missing skill that turns most messy real-world data into something tidy you can actually work with.

On this page

The basic pattern Multiple summaries at once Multiple grouping variables Counting groups: count()Grouped mutate(): per-group calculations that keep the rows Slicing: top N per group A realistic example: iris by species Multi-file challenge: monthly air quality Test your understanding