Measures of Spread

Range, variance, standard deviation, IQR, MAD, and the coefficient of variation — why how spread out the data is matters as much as where its center sits.

Two delivery services both average a 30-minute wait. One always arrives between 28 and 32 minutes. The other swings from 5 minutes to an hour. Same center, wildly different experience — and if you only reported the average, you'd call them identical. A measure of center tells you where the data sits; a measure of spread tells you how much it moves. You almost never understand a column from its center alone.

Spread is where risk, reliability, and surprise live. A model's error, a process's consistency, a portfolio's volatility, a sensor's noise — all of these are spread, not center. This page builds the vocabulary: range, variance, standard deviation, the IQR, the MAD, and the coefficient of variation, plus the one divisor detail (n−1) that trips up almost everyone.

Same center, different spread

Let's make the delivery story concrete. Both services have a mean of 30, but everything that matters is in the spread.

Center and spread are a package deal

Reporting a mean without a spread is like giving a GPS location with no accuracy radius. "Around 30 minutes, give or take 1" and "around 30 minutes, give or take 25" are completely different claims. Always pair a center with a spread.

Range: the quick-and-dirty spread

The range is just max − min. It's the easiest spread to compute and explain, and it's genuinely useful as a first glance or a sanity check (a range of 0 means a constant column; a range of 10,000 on what should be ages means a data-entry bug).

But the range has a fatal flaw for serious use: it depends on only the two most extreme points, so a single outlier or typo can blow it up. It also tends to grow as you collect more data — more observations mean more chances to see an extreme — so it's not comparable across different sample sizes.

The range is fragile

Because it uses only two values, the range throws away everything in between and is maximally sensitive to outliers. Use it for a quick gut-check, never as your primary spread measure.

Variance and standard deviation: spread around the mean

Most spread measures ask the same question: on average, how far are the points from the center? The natural idea is to take each point's distance from the mean and average those distances. But the raw deviations sum to zero (positives and negatives cancel — that's what "balance point" means). To stop the cancellation, we square each deviation before averaging.

That average of squared deviations is the variance. It's a real, honest spread number, but it's in squared units — square dollars, square minutes — which nobody can interpret. So we take the square root and get the standard deviation, which is back in the original units and reads as a typical distance from the mean.

s = √( Σ(xᵢ − x̄)² / (n − 1) )

You will rarely compute this by hand — numpy and pandas do it — but the shape matters: square the distances, average them, square-root back.

How to read a standard deviation

The standard deviation is in the same units as your data and means roughly "the typical amount a value differs from the mean." If daily sales average 100 with an SD of 12, a day of 112 or 88 is unremarkable; a day of 150 is about four SDs out — genuinely unusual. For roughly bell-shaped data, about two-thirds of values fall within one SD of the mean and about 95% within two SDs (we'll make this precise in The Normal Distribution).

The `n−1` divisor: why sample spread uses `ddof=1`

Here is the detail that confuses everyone, so let's make it intuitive rather than mathematical. When you compute spread from a sample (not the whole population), you divide the sum of squared deviations by n−1, not n.

Why? Because you measured the deviations from the sample mean, not the true population mean — and the sample mean is, by construction, the point that sits as close as possible to your sample's points. Your data hugs its own mean a little more tightly than it hugs the real population mean. So squared deviations from the sample mean are systematically a bit too small, and dividing by n would underestimate the true spread. Dividing by the smaller number n−1 nudges the estimate up to compensate. People say the sample mean "uses up one degree of freedom" — once you know the mean and n−1 of the values, the last value is determined, so only n−1 deviations are truly free to vary.

ddof = 'delta degrees of freedom'

In NumPy, ddof is what gets subtracted from n in the divisor. ddof=0 divides by n (the population formula). ddof=1 divides by n−1 (the sample formula). pandas defaults to ddof=1 (sample), while NumPy defaults to ddof=0 (population) — a classic source of "why don't my numbers match?" bugs.

Misconception: NumPy and pandas give the same std

They do not, by default. np.std(x) uses ddof=0; x.std() on a pandas Series uses ddof=1. For real-world data, which is almost always a sample of something larger, ddof=1 is the one you usually want. When you see a small discrepancy between two "standard deviations," check the divisor first.

For large n the difference between dividing by n and n−1 is tiny (1000 vs 999 barely moves the answer). It matters most for small samples — exactly when you're most at risk of underestimating spread.

IQR: the robust spread

Just as the median is a robust center, the interquartile range (IQR) is a robust spread. It's the range of the middle 50% of the data: the 75th percentile minus the 25th percentile (Q3 − Q1). Because it ignores the top and bottom 25%, no single outlier — however extreme — can move it. That makes it the spread of choice for skewed, outlier-prone data, and it's the engine behind box plots and the 1.5×IQR outlier rule we'll meet in Shape and Outliers.

Std vs IQR: which spread to report

Mirror your center choice. If the data is roughly symmetric and you're reporting the mean, report the standard deviation alongside it. If the data is skewed and you're reporting the median, report the IQR. Mixing a robust center with a non-robust spread (or vice versa) sends mixed signals.

MAD: another robust spread

The median absolute deviation (MAD) is a close cousin of the IQR. Take each point's absolute distance from the median, then take the median of those distances. Like the IQR, it shrugs off outliers because medians do. You'll see it in outlier detection and robust statistics; the idea is just "the typical distance from the middle, measured robustly."

CV: comparing spread across different scales

Here's a trap. Is a standard deviation of 5 "big"? You can't say — it depends entirely on the scale. An SD of 5 on human heights (cm) is modest; an SD of 5 on a 1-to-10 satisfaction score is enormous. Raw standard deviations are not comparable across variables measured in different units or at different magnitudes.

The coefficient of variation (CV) fixes this by dividing the standard deviation by the mean: CV = std / mean. It's a unitless, relative measure of spread — "how big is the spread compared to the typical value?" — which lets you compare the variability of things on totally different scales.

Misconception: a bigger standard deviation means more spread, period

Only within the same variable. Comparing the SD of revenue (millions) to the SD of conversion rate (a fraction) tells you nothing — the revenue SD is bigger purely because the numbers are bigger. To compare variability across different scales, use the CV. (Caveat: the CV needs a meaningful, positive mean; it's not appropriate when the mean is near zero or the data can go negative, like temperatures in Celsius.)

QuestionSelect one

Stock A has a mean daily price of $500 with a standard deviation of $20. Stock B has a mean price of $10 with a standard deviation of $3. Which stock is relatively more volatile, and what should you compute to decide?

Stock A, because its standard deviation ($20) is larger

Stock B, because its coefficient of variation (3/10 = 0.30) is far higher than A's (20/500 = 0.04)

They are equally volatile because volatility is absolute

You cannot compare them without the raw price histories

Putting it together: a spread summary

In practice you compute several spread measures together and read them against each other, just like you did with centers. A big gap between the standard deviation and the IQR is itself a clue that outliers or skew are present.

You're given two pandas Series, group_a and group_b, of measurements on the same scale.

Part 1 — spread of group A. Build a dict spread_a with these keys (all values plain Python float):

"std" — the sample standard deviation (ddof=1)
"iqr" — the interquartile range, using scipy.stats.iqr
"cv" — the coefficient of variation, using scipy.stats.variation with ddof=1

Part 2 — compare. Set a string more_variable to "a" if group_a has the larger sample standard deviation, otherwise "b".

Use the provided Series. Make sure every dict value is a float (not a NumPy scalar).

Check your understanding

QuestionSelect one

You compute a column's spread two ways and get np.std(x) = 11.8 and x.std() (on the same data as a pandas Series) = 12.0. What explains the difference?

One of the two libraries has a bug

NumPy's default uses ddof=0 (divide by n) while pandas defaults to ddof=1 (divide by n−1), so the pandas value is slightly larger

The data must contain missing values that pandas dropped

pandas rounds and NumPy does not

QuestionSelect one

Why do we divide by n−1 instead of n when estimating spread from a sample?

To make the standard deviation smaller and more conservative

Because the population is always smaller than the sample

Deviations are measured from the sample mean, which hugs the sample too tightly, so dividing by n would underestimate the true spread; n−1 corrects for that

It is an arbitrary convention with no real justification

QuestionSelect one

A dataset of file sizes is strongly right-skewed with a few enormous files. You want a spread measure that the giant files won't dominate. Which is the best choice?

The range (max − min)

The standard deviation

The interquartile range (IQR)

The variance

QuestionSelect one

The standard deviation is preferred over the variance for reporting spread mainly because:

It is always smaller than the variance

It is more robust to outliers than the variance

It is expressed in the same units as the data, whereas the variance is in squared units that are hard to interpret

The variance can be negative but the standard deviation cannot

QuestionSelect one

When is the coefficient of variation the appropriate spread measure?

Whenever you want the most robust possible measure of spread

For any data, since it always improves on the standard deviation

When comparing the relative variability of variables measured on different scales or in different units, and the mean is positive and meaningful

Only when the data is perfectly symmetric

Key takeaways

Spread matters as much as center — risk, reliability, and surprise all live in how much the data moves, not where it sits.
Range = max − min: quick but fragile and outlier-driven.
Variance = average squared deviation (squared units); standard deviation = its square root, in the data's own units and readable as a typical distance from the mean.
Use ddof=1 for sample spread (pandas' default); NumPy defaults to ddof=0 — a common mismatch.
IQR and MAD are robust spreads (middle 50% / median distance): pair them with the median on skewed data.
CV = std / mean is unitless — the right tool for comparing variability across different scales (when the mean is positive).

Measures of Spread

On this page