Summary Statistics
Mean, median, standard deviation, quantiles — the small set of numbers that lets you describe an entire column in a single sentence.
When you can't look at every value in a column (and you usually can't — even 100 numbers is too many to mentally hold), you need to summarize it. A handful of summary statistics, chosen well, can carry an enormous amount of information.
The two questions every summary answers
Every summary statistic answers one of two questions:
- Where is the center? (location: mean, median)
- How spread out are the values? (dispersion: sd, IQR, range)
Together, these two ideas describe a column with surprising faithfulness.
Measures of center
The mean and the median tell different stories. Both are "averages" in everyday language, but:
- Mean is the sum divided by the count. It uses every value equally and is sensitive to outliers.
- Median is the middle value when sorted. Half the data is below, half is above. It is robust to outliers.
In our income example, the mean ($68k) is misleading — it makes the typical person sound wealthier than they are. The median ($49.5k) is closer to what most people earn.
Rule of thumb: if mean and median differ noticeably, your distribution is skewed, and you should report the median (or both).
Measures of spread
Two distributions with identical means can have wildly different shapes. Spread tells you "how concentrated is the data near its center?"
- Standard deviation (sd): roughly, the typical distance from the mean.
- Variance: the square of the standard deviation. Same idea, different scale.
- Range: max minus min. Simple but extremely sensitive to outliers.
- IQR: the spread of the middle 50% of values. Robust to outliers.
Quantiles: a richer picture
Quantiles split sorted data into equal-sized chunks. The quartiles (25%, 50%, 75%) and quintiles (every 20%) are the most common.
The classic "five-number summary" (min, Q1, median, Q3, max) is
what summary() and the boxplot are built around. We'll see
boxplots in the visualization section.
Putting it all together: summary()
R's built-in summary() for a numeric vector gives you the
five-number summary plus the mean, for free:
On a data frame, it does this column by column. It's the fastest way to size up a dataset.
Summary statistics by group
Combine dplyr with the summary statistics we know:
In one table, you can see that setosa petals are much shorter and tighter than the others, while virginica petals are longer and more variable.
A subtlety: which mean for which question?
Sometimes the "right" summary depends on what you're actually trying to communicate.
- "What does a typical day look like?" → median
- "What is the total resource cost?" → mean × count (or
sum()directly) - "What's the worst day we should plan for?" → maybe max, or the 95th percentile
- "How variable is performance?" → sd or IQR
Every summary is a compression of the data. You're throwing information away. Choose the summary that throws away the information you don't need.
When summary statistics lie
Anscombe's quartet is a famous demonstration that the same summary statistics can describe wildly different datasets:
All four datasets have nearly identical means and standard deviations. But if you plotted them (try it!), they look completely different — one is roughly linear, one is curved, one has a single outlier dominating the line, and one has all its variation in a single point.
The moral: summary statistics are not a substitute for looking at your data. They are a complement. Always combine them with a plot.
Test your understanding
Which measure of center is robust to outliers?
mean
variance
median
range
Two columns both have mean 50, but one has sd 2 and the other has sd 20. The one with sd 20:
has more values
has higher values
has values that are more spread out around 50
has fewer NAs
Anscombe's quartet is famous because it shows:
That R has poor numerical precision.
Four very different datasets can share nearly identical summary statistics — proving you must visualize, not just summarize.
That correlation always implies causation.
That outliers should be removed.
Mini challenge: describe a column
Write a function describe() that takes a numeric vector and
returns a named numeric vector of (n, mean, median, sd, min, max,
iqr) — ignoring NAs throughout.
Implement describe(x) so that it returns a named numeric vector with the listed statistics. NAs should be ignored.
Summary statistics compress a column into a few numbers. The next page is about seeing the column's full shape — the world of distributions.
Reshaping Data
How to flip data between "wide" and "long" forms — the missing skill that turns most messy real-world data into something tidy you can actually work with.
Exploring Distributions
Histograms, density plots, and boxplots — three ways of *seeing* the entire shape of a column at once. The visual companion to summary statistics.