Aggregation Basics
Sum, mean, median, min, max — the simple reductions that turn many rows into one number, and the subtle choices that change what they mean.
An aggregation turns many values into one. The mean of a column, the count of rows, the maximum price, the sum of revenue — these are all aggregations.
This chapter covers single-Series aggregations. The next chapter
introduces groupby, which is aggregations per group.
The reduction methods
Every numeric Series supports these:
| Method | Returns the… |
|---|---|
.sum() | Sum |
.mean() | Arithmetic mean |
.median() | Middle value |
.min() | Minimum |
.max() | Maximum |
.std() | Standard deviation |
.var() | Variance |
.quantile(q) | Quantile (e.g. 0.25) |
.count() | Number of non-null values |
.nunique() | Number of distinct values |
.first() | First value |
.last() | Last value |
How aggregations handle missing values
By default, NaN values are skipped. That sounds harmless but matters more than you think.
Compare with what you might naively expect: dividing by 5 (the length) would give 14.0. Pandas divides by 3 (the count of non-null values) and gives 23.33. This is almost always what you want — but be aware.
Mean vs median
The two most-cited "averages" mean different things.
- Mean — sum / count. Sensitive to outliers.
- Median — middle value when sorted. Robust to outliers.
A single outlier moved the mean from ~63 (where the typical person sits) to over 100. The median is unaffected. When reporting "typical" values to a non-technical audience, the median is often a better choice — and you should usually report both.
A useful habit
For any "typical X" question, look at the mean and the median. If they agree, the distribution is roughly symmetric. If they diverge, dig in: outliers or skew may be hiding important information.
.describe() — the multi-statistic summary
Get all the common statistics in one call:
On a DataFrame, describe does the same per numeric column.
On a non-numeric column, it returns count / unique / top / freq.
.agg() — multiple aggregations at once
.agg lets you compute several aggregations in one go, getting
back a Series or DataFrame of results.
You can also pass a dict to apply different aggregations to different columns:
Aggregating filtered subsets
You can combine filters with aggregations to answer questions like "average salary of engineers."
This works, but if you find yourself doing it for every
department, you want groupby — coming up next chapter.
value_counts — a one-shot category breakdown
For a categorical column, .value_counts() gives you both the
distinct values and their counts.
normalize=True converts counts to proportions — the fastest
way to compute category shares.
Counting nulls and uniques
Two cheap and useful one-liners:
Check your understanding
If a column has values [10, 20, NaN, 40, NaN], what does .mean() return by default?
NaN
14.0
23.33 (= (10+20+40) / 3)
70.0
A salary column has values [50, 55, 60, 60, 65, 70, 75, 500]. Which statistic is least affected by the 500 outlier?
The mean
The sum
The median
The max
What does s.value_counts(normalize=True) produce?
The counts of unique values
A list of unique values
The proportions (fractions of total) of each unique value
A sorted Series