Practical R for Beginners

The Age of Data Statistics Before Computers The Rise of Statistical Computing From S to R Why R Matters Today Reproducible Analysis

Thinking in Data Your First R Program R as a Calculator Variables and Assignment

Vectors Everywhere Vectorized Computation Logical and Character Vectors Missing Values (NA)

Data Frames Inspecting a Dataset Subsetting and Filtering Tidy Data Principles

The dplyr Verbs Grouped Analysis Reshaping Data

Summary Statistics Exploring Distributions Relationships Between Variables

Principles of Visualization The ggplot2 Grammar Interpreting Plots

Uncertainty and Variability Sampling and Distributions Intuition for Inference

Writing Your Own Functions Scripts and Projects

Mini Project Walkthrough Next Steps

Summary Statistics

Mean, median, standard deviation, quantiles — the small set of numbers that lets you describe an entire column in a single sentence.

When you can't look at every value in a column (and you usually can't — even 100 numbers is too many to mentally hold), you need to summarize it. A handful of summary statistics, chosen well, can carry an enormous amount of information.

The two questions every summary answers

Every summary statistic answers one of two questions:

Where is the center? (location: mean, median)
How spread out are the values? (dispersion: sd, IQR, range)

Together, these two ideas describe a column with surprising faithfulness.

Measures of center

Code Block

R 4.6.0

The mean and the median tell different stories. Both are "averages" in everyday language, but:

Mean is the sum divided by the count. It uses every value equally and is sensitive to outliers.
Median is the middle value when sorted. Half the data is below, half is above. It is robust to outliers.

In our income example, the mean ($68k) is misleading — it makes the typical person sound wealthier than they are. The median ($49.5k) is closer to what most people earn.

Rule of thumb: if mean and median differ noticeably, your distribution is skewed, and you should report the median (or both).

Measures of spread

Code Block

R 4.6.0

Two distributions with identical means can have wildly different shapes. Spread tells you "how concentrated is the data near its center?"

Standard deviation (sd): roughly, the typical distance from the mean.
Variance: the square of the standard deviation. Same idea, different scale.
Range: max minus min. Simple but extremely sensitive to outliers.
IQR: the spread of the middle 50% of values. Robust to outliers.

Quantiles: a richer picture

Quantiles split sorted data into equal-sized chunks. The quartiles (25%, 50%, 75%) and quintiles (every 20%) are the most common.

Code Block

R 4.6.0

The classic "five-number summary" (min, Q1, median, Q3, max) is what summary() and the boxplot are built around. We'll see boxplots in the visualization section.

Putting it all together: `summary()`

R's built-in summary() for a numeric vector gives you the five-number summary plus the mean, for free:

Code Block

R 4.6.0

On a data frame, it does this column by column. It's the fastest way to size up a dataset.

Summary statistics by group

Combine dplyr with the summary statistics we know:

Code Block

R 4.6.0

In one table, you can see that setosa petals are much shorter and tighter than the others, while virginica petals are longer and more variable.

A subtlety: which mean for which question?

Sometimes the "right" summary depends on what you're actually trying to communicate.

"What does a typical day look like?" → median
"What is the total resource cost?" → mean × count (or sum() directly)
"What's the worst day we should plan for?" → maybe max, or the 95th percentile
"How variable is performance?" → sd or IQR

Every summary is a compression of the data. You're throwing information away. Choose the summary that throws away the information you don't need.

When summary statistics lie

Anscombe's quartet is a famous demonstration that the same summary statistics can describe wildly different datasets:

Code Block

R 4.6.0

All four datasets have nearly identical means and standard deviations. But if you plotted them (try it!), they look completely different — one is roughly linear, one is curved, one has a single outlier dominating the line, and one has all its variation in a single point.

The moral: summary statistics are not a substitute for looking at your data. They are a complement. Always combine them with a plot.

Test your understanding

QuestionSelect one

Which measure of center is robust to outliers?

mean

variance

median

range

QuestionSelect one

Two columns both have mean 50, but one has sd 2 and the other has sd 20. The one with sd 20:

has more values

has higher values

has values that are more spread out around 50

has fewer NAs

QuestionSelect one

Anscombe's quartet is famous because it shows:

That R has poor numerical precision.

Four very different datasets can share nearly identical summary statistics — proving you must visualize, not just summarize.

That correlation always implies causation.

That outliers should be removed.

Mini challenge: describe a column

Write a function describe() that takes a numeric vector and returns a named numeric vector of (n, mean, median, sd, min, max, iqr) — ignoring NAs throughout.

Challenge

R 4.6.0

Build a describe() helper

Implement describe(x) so that it returns a named numeric vector with the listed statistics. NAs should be ignored.

Summary statistics compress a column into a few numbers. The next page is about seeing the column's full shape — the world of distributions.

Reshaping Data

How to flip data between "wide" and "long" forms — the missing skill that turns most messy real-world data into something tidy you can actually work with.

Exploring Distributions

Histograms, density plots, and boxplots — three ways of *seeing* the entire shape of a column at once. The visual companion to summary statistics.

On this page

The two questions every summary answers Measures of center Measures of spread Quantiles: a richer picture Putting it all together: summary()Summary statistics by group A subtlety: which mean for which question?When summary statistics lie Test your understanding Mini challenge: describe a column