Measures of Spread
Range, variance, standard deviation, IQR, MAD, and the coefficient of variation — why how spread out the data is matters as much as where its center sits.
Two delivery services both average a 30-minute wait. One always arrives between 28 and 32 minutes. The other swings from 5 minutes to an hour. Same center, wildly different experience — and if you only reported the average, you'd call them identical. A measure of center tells you where the data sits; a measure of spread tells you how much it moves. You almost never understand a column from its center alone.
Spread is where risk, reliability, and surprise live. A model's error,
a process's consistency, a portfolio's volatility, a sensor's noise —
all of these are spread, not center. This page builds the vocabulary:
range, variance, standard deviation, the IQR, the
MAD, and the coefficient of variation, plus the one divisor
detail (n−1) that trips up almost everyone.
Same center, different spread
Let's make the delivery story concrete. Both services have a mean of 30, but everything that matters is in the spread.
Center and spread are a package deal
Reporting a mean without a spread is like giving a GPS location with no accuracy radius. "Around 30 minutes, give or take 1" and "around 30 minutes, give or take 25" are completely different claims. Always pair a center with a spread.
Range: the quick-and-dirty spread
The range is just max − min. It's the easiest spread to compute
and explain, and it's genuinely useful as a first glance or a
sanity check (a range of 0 means a constant column; a range of 10,000
on what should be ages means a data-entry bug).
But the range has a fatal flaw for serious use: it depends on only the two most extreme points, so a single outlier or typo can blow it up. It also tends to grow as you collect more data — more observations mean more chances to see an extreme — so it's not comparable across different sample sizes.
The range is fragile
Because it uses only two values, the range throws away everything in between and is maximally sensitive to outliers. Use it for a quick gut-check, never as your primary spread measure.
Variance and standard deviation: spread around the mean
Most spread measures ask the same question: on average, how far are the points from the center? The natural idea is to take each point's distance from the mean and average those distances. But the raw deviations sum to zero (positives and negatives cancel — that's what "balance point" means). To stop the cancellation, we square each deviation before averaging.
That average of squared deviations is the variance. It's a real, honest spread number, but it's in squared units — square dollars, square minutes — which nobody can interpret. So we take the square root and get the standard deviation, which is back in the original units and reads as a typical distance from the mean.
s = √( Σ(xᵢ − x̄)² / (n − 1) )
You will rarely compute this by hand — numpy and pandas do it — but
the shape matters: square the distances, average them, square-root back.
How to read a standard deviation
The standard deviation is in the same units as your data and means roughly "the typical amount a value differs from the mean." If daily sales average 100 with an SD of 12, a day of 112 or 88 is unremarkable; a day of 150 is about four SDs out — genuinely unusual. For roughly bell-shaped data, about two-thirds of values fall within one SD of the mean and about 95% within two SDs (we'll make this precise in The Normal Distribution).
The n−1 divisor: why sample spread uses ddof=1
Here is the detail that confuses everyone, so let's make it intuitive
rather than mathematical. When you compute spread from a sample
(not the whole population), you divide the sum of squared deviations by
n−1, not n.
Why? Because you measured the deviations from the sample mean, not
the true population mean — and the sample mean is, by construction, the
point that sits as close as possible to your sample's points. Your data
hugs its own mean a little more tightly than it hugs the real
population mean. So squared deviations from the sample mean are
systematically a bit too small, and dividing by n would
underestimate the true spread. Dividing by the smaller number
n−1 nudges the estimate up to compensate. People say the sample mean
"uses up one degree of freedom" — once you know the mean and n−1 of
the values, the last value is determined, so only n−1 deviations are
truly free to vary.
ddof = 'delta degrees of freedom'
In NumPy, ddof is what gets subtracted from n in the divisor.
ddof=0 divides by n (the population formula). ddof=1 divides
by n−1 (the sample formula). pandas defaults to ddof=1
(sample), while NumPy defaults to ddof=0 (population) — a classic
source of "why don't my numbers match?" bugs.
Misconception: NumPy and pandas give the same std
They do not, by default. np.std(x) uses ddof=0; x.std() on a
pandas Series uses ddof=1. For real-world data, which is almost
always a sample of something larger, ddof=1 is the one you
usually want. When you see a small discrepancy between two "standard
deviations," check the divisor first.
For large n the difference between dividing by n and n−1 is tiny
(1000 vs 999 barely moves the answer). It matters most for small
samples — exactly when you're most at risk of underestimating spread.
IQR: the robust spread
Just as the median is a robust center, the interquartile range
(IQR) is a robust spread. It's the range of the middle 50% of the
data: the 75th percentile minus the 25th percentile (Q3 − Q1).
Because it ignores the top and bottom 25%, no single outlier — however
extreme — can move it. That makes it the spread of choice for skewed,
outlier-prone data, and it's the engine behind box plots and the
1.5×IQR outlier rule we'll meet in Shape and Outliers.
Std vs IQR: which spread to report
Mirror your center choice. If the data is roughly symmetric and you're reporting the mean, report the standard deviation alongside it. If the data is skewed and you're reporting the median, report the IQR. Mixing a robust center with a non-robust spread (or vice versa) sends mixed signals.
MAD: another robust spread
The median absolute deviation (MAD) is a close cousin of the IQR. Take each point's absolute distance from the median, then take the median of those distances. Like the IQR, it shrugs off outliers because medians do. You'll see it in outlier detection and robust statistics; the idea is just "the typical distance from the middle, measured robustly."
CV: comparing spread across different scales
Here's a trap. Is a standard deviation of 5 "big"? You can't say — it depends entirely on the scale. An SD of 5 on human heights (cm) is modest; an SD of 5 on a 1-to-10 satisfaction score is enormous. Raw standard deviations are not comparable across variables measured in different units or at different magnitudes.
The coefficient of variation (CV) fixes this by dividing the
standard deviation by the mean: CV = std / mean. It's a unitless,
relative measure of spread — "how big is the spread compared to the
typical value?" — which lets you compare the variability of things on
totally different scales.
Misconception: a bigger standard deviation means more spread, period
Only within the same variable. Comparing the SD of revenue (millions) to the SD of conversion rate (a fraction) tells you nothing — the revenue SD is bigger purely because the numbers are bigger. To compare variability across different scales, use the CV. (Caveat: the CV needs a meaningful, positive mean; it's not appropriate when the mean is near zero or the data can go negative, like temperatures in Celsius.)
Stock A has a mean daily price of $500 with a standard deviation of $20. Stock B has a mean price of $10 with a standard deviation of $3. Which stock is relatively more volatile, and what should you compute to decide?
Stock A, because its standard deviation ($20) is larger
Stock B, because its coefficient of variation (3/10 = 0.30) is far higher than A's (20/500 = 0.04)
They are equally volatile because volatility is absolute
You cannot compare them without the raw price histories
Putting it together: a spread summary
In practice you compute several spread measures together and read them against each other, just like you did with centers. A big gap between the standard deviation and the IQR is itself a clue that outliers or skew are present.
You're given two pandas Series, group_a and group_b, of measurements on the same scale.
Part 1 — spread of group A. Build a dict spread_a with these keys (all values plain Python float):
"std"— the sample standard deviation (ddof=1)"iqr"— the interquartile range, usingscipy.stats.iqr"cv"— the coefficient of variation, usingscipy.stats.variationwithddof=1
Part 2 — compare. Set a string more_variable to "a" if group_a has the larger sample standard deviation, otherwise "b".
Use the provided Series. Make sure every dict value is a float (not a NumPy scalar).
Check your understanding
You compute a column's spread two ways and get np.std(x) = 11.8 and x.std() (on the same data as a pandas Series) = 12.0. What explains the difference?
One of the two libraries has a bug
NumPy's default uses ddof=0 (divide by n) while pandas defaults to ddof=1 (divide by n−1), so the pandas value is slightly larger
The data must contain missing values that pandas dropped
pandas rounds and NumPy does not
Why do we divide by n−1 instead of n when estimating spread from a sample?
To make the standard deviation smaller and more conservative
Because the population is always smaller than the sample
Deviations are measured from the sample mean, which hugs the sample too tightly, so dividing by n would underestimate the true spread; n−1 corrects for that
It is an arbitrary convention with no real justification
A dataset of file sizes is strongly right-skewed with a few enormous files. You want a spread measure that the giant files won't dominate. Which is the best choice?
The range (max − min)
The standard deviation
The interquartile range (IQR)
The variance
The standard deviation is preferred over the variance for reporting spread mainly because:
It is always smaller than the variance
It is more robust to outliers than the variance
It is expressed in the same units as the data, whereas the variance is in squared units that are hard to interpret
The variance can be negative but the standard deviation cannot
When is the coefficient of variation the appropriate spread measure?
Whenever you want the most robust possible measure of spread
For any data, since it always improves on the standard deviation
When comparing the relative variability of variables measured on different scales or in different units, and the mean is positive and meaningful
Only when the data is perfectly symmetric
Key takeaways
- Spread matters as much as center — risk, reliability, and surprise all live in how much the data moves, not where it sits.
- Range = max − min: quick but fragile and outlier-driven.
- Variance = average squared deviation (squared units); standard deviation = its square root, in the data's own units and readable as a typical distance from the mean.
- Use
ddof=1for sample spread (pandas' default); NumPy defaults toddof=0— a common mismatch. - IQR and MAD are robust spreads (middle 50% / median distance): pair them with the median on skewed data.
- CV = std / mean is unitless — the right tool for comparing variability across different scales (when the mean is positive).
Measures of Center
Mean, median, and mode — what each one captures, when each is the honest summary, and why a single "average" can mislead you on skewed data.
Shape and Outliers
Reading the shape of a distribution — skewness, modality, and tail heaviness — and a disciplined way to find outliers and decide what to actually do with them.