Measures of Center
Mean, median, and mode — what each one captures, when each is the honest summary, and why a single "average" can mislead you on skewed data.
You have a column of numbers — salaries, response times, order values — and someone asks the most innocent question in data science: "what's the typical value?" That word "typical" is doing a lot of work. It hides a choice between three different summaries — the mean, the median, and the mode — that can disagree wildly. Picking the wrong one is how a perfectly honest analyst ends up reporting a number that no real person in the data would recognize.
A measure of center collapses a whole distribution down to one number that's supposed to stand in for "the middle." The trouble is that "the middle" isn't one idea. This page is about knowing which center you're actually asking for, and when each one quietly lies.
The three centers, and what each one means
- Mean (the arithmetic average): add everything up, divide by the count. It's the balance point of the data — the spot where the values on either side exactly counterweight each other.
- Median: sort the values and take the middle one (or the average of the two middle ones). Half the data sits below it, half above. It answers "what's the value of the typical member?"
- Mode: the most frequently occurring value. It answers "what's the most common outcome?" — the only one of the three that makes sense for categories like favorite color or payment method.
When data is roughly symmetric and has no extreme values — like that ticket count — all three land in the same neighborhood, and the choice barely matters. The interesting cases are when they diverge. That divergence is not a nuisance; it's information about the shape of your data.
They agree only when the data is symmetric
For a perfectly symmetric, single-peaked distribution, the mean, median, and mode all coincide. The further apart they drift, the more skewed your data is — so the gap between mean and median is itself a quick read on shape (which we explore in Shape and Outliers).
Where the mean lies: skew and outliers
The mean's defining property — it's the balance point — is also its weakness. A balance point is dragged toward heavy values, no matter how few of them there are. One billionaire in a room of teachers pulls the average net worth into the millions, even though that number describes nobody present. This is not a rare edge case; it's the default for money, sizes, durations, and counts, which are almost all right-skewed (a long tail stretching toward large values).
Adding a single person moved the mean by tens of thousands but nudged the median by almost nothing. That's the whole story in one example: the median is robust — resistant to a handful of extreme values — while the mean is sensitive to them. Neither is "right" in the abstract; they answer different questions. But if you report the mean salary of that company, you'll quote a number that's higher than what almost everyone earns.
Misconception: the mean is always 'the typical value'
On skewed data the mean is not typical — it's pulled toward the long tail and can sit above most of your observations. "Average household income" is famously misleading for exactly this reason: a few very high earners lift the mean well above what a middle household actually makes. On right-skewed data, the median is usually the honest "typical."
Misconception: 'average' always means the mean
In everyday speech "the average" defaults to the mean, but statistically average just means "a measure of center" — the median and mode are averages too. When a report says "the average user," always ask which center they computed and whether the data is skewed.
A quick MCQ before we go further
A dataset of home prices in a neighborhood is strongly right-skewed: most homes are modest, but a few mansions sell for 10x the rest. A realtor wants to advertise the "typical" price. Which measure is the most honest summary?
The mean, because it uses every value in the data
The median, because half the homes are above it and half below, unaffected by the extreme high sales
The mode, because it's the single most common price
The mean, because medians throw away information
When each measure is the right tool
The decision is mostly about data type and shape:
- Mode is the only center that works for categorical data (you can't average "red" and "blue"), and it's the right call for asking "what's the most common outcome?" It also flags multimodal data — two peaks usually mean two groups mixed together, and no single center describes them well.
- Mean is the natural choice for symmetric numeric data with no extreme outliers. It uses every value, has convenient mathematical properties, and underlies most of the inference later in this course (the mean of a sample is what the central limit theorem is about).
- Median is the right call whenever data is skewed or outlier-prone: incomes, house prices, response times, file sizes, wait times. It answers "what does the middle case look like?"
A practical habit: report both
When you're unsure, compute the mean and the median and look at the gap. If they're close, report the mean. If they diverge, that gap is telling you the data is skewed — report the median as the headline and mention the mean only with context. Showing both is often the most honest move.
Three useful variations (keep these light)
Most of the time, mean/median/mode are all you need. But three relatives show up often enough to recognize:
Trimmed mean. Chop off a percentage from each end (say the top and bottom 10%), then take the mean of what's left. It's a compromise: more robust than the mean, but still uses most of the data. This is exactly how Olympic judging and many sensor pipelines discard extremes before averaging.
Weighted mean. When some observations count more than others — larger stores, more-reliable sensors, bigger survey strata — give each a weight. A company-wide average satisfaction score should weight each team by headcount, not treat a 3-person team the same as a 300-person one.
Geometric mean. For rates, ratios, and growth factors (returns, percent changes, fold-changes), the arithmetic mean overstates typical growth. The geometric mean multiplies the values and takes the nth root, which is the correct "average factor" for things that compound.
Notice the last two lines: raising the geometric mean to the 4th power exactly reproduces the total compounded growth, while the arithmetic mean would not. That's the tell that you're in geometric-mean territory — whenever the quantities multiply rather than add.
When to reach for the geometric mean
Use it for anything expressed as a rate or multiplier that compounds: investment returns, population growth, "our traffic grew 3x then 0.5x then 2x." Averaging those with the arithmetic mean overstates typical growth. For plain additive quantities (heights, temperatures, dollars), stick with the arithmetic mean or median.
Putting it together: a robust center summary
In real EDA you rarely report a single number — you report a small summary and let the gaps between numbers tell you about shape. Here's the pattern you'll reuse constantly: compute several centers at once and read them together.
The positive mean_minus_median is your skew alarm. When it's large
and positive, lead with the median.
You're given a pandas Series prices of right-skewed product prices. Build a summary you can trust on skewed data.
Compute a dictionary called result with exactly these keys (all values plain Python float):
"mean"— the arithmetic mean"median"— the median"trimmed_mean"— the 10% trimmed mean (drop 10% from each end) usingscipy.stats.trim_mean"skew_gap"—meanminusmedian
Then set a boolean report_median to True if skew_gap is greater than 0 (i.e. right-skewed, so the median is the more honest headline), else False.
Use the provided prices Series.
Check your understanding
You compute the mean monthly spend of your users as $84 and the median as $41. What is the most defensible reading of this gap?
The data is left-skewed, so a few very low spenders pull the mean down
The data is right-skewed; a minority of high spenders inflate the mean, so the median ($41) better reflects a typical user
The mean must be wrong because it should be close to the median
Half of all users spend exactly $84
A survey asks for respondents' favorite payment method (cash, credit, debit, mobile). Which measure of center even makes sense for this column?
The mean of the four options
The median payment method
The mode — the most frequently chosen payment method
None of them; categorical data has no center
An investment returns +50% one year and −50% the next. Someone reports the "average annual return" as 0% using the arithmetic mean. Why is the geometric mean the better tool here?
The arithmetic mean of percentages is always undefined
Returns compound multiplicatively, and $100 → $150 → $75 is a real loss; the geometric mean of the growth factors (1.5 and 0.5) captures that, while the arithmetic mean hides it
The geometric mean is just a more precise version of the arithmetic mean
Because percentages can be negative
Which statement about the trimmed mean is accurate?
It is identical to the median
It is more sensitive to outliers than the ordinary mean
It discards a fixed percentage of the smallest and largest values, then averages what remains — a middle ground between mean and median in robustness
It can only be used on symmetric data
Key takeaways
- Mean = balance point; uses all data; great for symmetric data but dragged around by skew and outliers.
- Median = middle value; robust; the honest "typical" for skewed data like income, prices, and durations.
- Mode = most common value; the only center for categorical data and a flag for multimodality.
- The gap between mean and median is a free read on skew — when they diverge, lead with the median.
- Reach for the geometric mean for compounding rates and the trimmed/weighted mean when you need robustness or unequal weights.
Types of Data
Why a variable's type decides which summaries, charts, and tests are valid — categorical vs numerical, the four measurement scales, and the encoding traps that make people average things that can't be averaged.
Measures of Spread
Range, variance, standard deviation, IQR, MAD, and the coefficient of variation — why how spread out the data is matters as much as where its center sits.