Dataslope logoDataslope

Summary Statistics

Mean, median, standard deviation, quantiles — the small set of numbers that lets you describe an entire column in a single sentence.

When you can't look at every value in a column (and you usually can't — even 100 numbers is too many to mentally hold), you need to summarize it. A handful of summary statistics, chosen well, can carry an enormous amount of information.

The two questions every summary answers

Every summary statistic answers one of two questions:

  1. Where is the center? (location: mean, median)
  2. How spread out are the values? (dispersion: sd, IQR, range)

Together, these two ideas describe a column with surprising faithfulness.

Measures of center

Code Block
R 4.6.0

The mean and the median tell different stories. Both are "averages" in everyday language, but:

  • Mean is the sum divided by the count. It uses every value equally and is sensitive to outliers.
  • Median is the middle value when sorted. Half the data is below, half is above. It is robust to outliers.

In our income example, the mean ($68k) is misleading — it makes the typical person sound wealthier than they are. The median ($49.5k) is closer to what most people earn.

Rule of thumb: if mean and median differ noticeably, your distribution is skewed, and you should report the median (or both).

Measures of spread

Code Block
R 4.6.0

Two distributions with identical means can have wildly different shapes. Spread tells you "how concentrated is the data near its center?"

  • Standard deviation (sd): roughly, the typical distance from the mean.
  • Variance: the square of the standard deviation. Same idea, different scale.
  • Range: max minus min. Simple but extremely sensitive to outliers.
  • IQR: the spread of the middle 50% of values. Robust to outliers.

Quantiles: a richer picture

Quantiles split sorted data into equal-sized chunks. The quartiles (25%, 50%, 75%) and quintiles (every 20%) are the most common.

Code Block
R 4.6.0

The classic "five-number summary" (min, Q1, median, Q3, max) is what summary() and the boxplot are built around. We'll see boxplots in the visualization section.

Putting it all together: summary()

R's built-in summary() for a numeric vector gives you the five-number summary plus the mean, for free:

Code Block
R 4.6.0

On a data frame, it does this column by column. It's the fastest way to size up a dataset.

Summary statistics by group

Combine dplyr with the summary statistics we know:

Code Block
R 4.6.0

In one table, you can see that setosa petals are much shorter and tighter than the others, while virginica petals are longer and more variable.

A subtlety: which mean for which question?

Sometimes the "right" summary depends on what you're actually trying to communicate.

  • "What does a typical day look like?" → median
  • "What is the total resource cost?" → mean × count (or sum() directly)
  • "What's the worst day we should plan for?" → maybe max, or the 95th percentile
  • "How variable is performance?" → sd or IQR

Every summary is a compression of the data. You're throwing information away. Choose the summary that throws away the information you don't need.

When summary statistics lie

Anscombe's quartet is a famous demonstration that the same summary statistics can describe wildly different datasets:

Code Block
R 4.6.0

All four datasets have nearly identical means and standard deviations. But if you plotted them (try it!), they look completely different — one is roughly linear, one is curved, one has a single outlier dominating the line, and one has all its variation in a single point.

The moral: summary statistics are not a substitute for looking at your data. They are a complement. Always combine them with a plot.

Test your understanding

QuestionSelect one

Which measure of center is robust to outliers?

mean

variance

median

range

QuestionSelect one

Two columns both have mean 50, but one has sd 2 and the other has sd 20. The one with sd 20:

has more values

has higher values

has values that are more spread out around 50

has fewer NAs

QuestionSelect one

Anscombe's quartet is famous because it shows:

That R has poor numerical precision.

Four very different datasets can share nearly identical summary statistics — proving you must visualize, not just summarize.

That correlation always implies causation.

That outliers should be removed.

Mini challenge: describe a column

Write a function describe() that takes a numeric vector and returns a named numeric vector of (n, mean, median, sd, min, max, iqr) — ignoring NAs throughout.

Challenge
R 4.6.0
Build a describe() helper

Implement describe(x) so that it returns a named numeric vector with the listed statistics. NAs should be ignored.

Summary statistics compress a column into a few numbers. The next page is about seeing the column's full shape — the world of distributions.

On this page