Dataslope logoDataslope

Aggregation Basics

Sum, mean, median, min, max — the simple reductions that turn many rows into one number, and the subtle choices that change what they mean.

An aggregation turns many values into one. The mean of a column, the count of rows, the maximum price, the sum of revenue — these are all aggregations.

This chapter covers single-Series aggregations. The next chapter introduces groupby, which is aggregations per group.

The reduction methods

Every numeric Series supports these:

MethodReturns the…
.sum()Sum
.mean()Arithmetic mean
.median()Middle value
.min()Minimum
.max()Maximum
.std()Standard deviation
.var()Variance
.quantile(q)Quantile (e.g. 0.25)
.count()Number of non-null values
.nunique()Number of distinct values
.first()First value
.last()Last value
Code Block
Python 3.13.2

How aggregations handle missing values

By default, NaN values are skipped. That sounds harmless but matters more than you think.

Code Block
Python 3.13.2

Compare with what you might naively expect: dividing by 5 (the length) would give 14.0. Pandas divides by 3 (the count of non-null values) and gives 23.33. This is almost always what you want — but be aware.

Mean vs median

The two most-cited "averages" mean different things.

  • Mean — sum / count. Sensitive to outliers.
  • Median — middle value when sorted. Robust to outliers.
Code Block
Python 3.13.2

A single outlier moved the mean from ~63 (where the typical person sits) to over 100. The median is unaffected. When reporting "typical" values to a non-technical audience, the median is often a better choice — and you should usually report both.

A useful habit

For any "typical X" question, look at the mean and the median. If they agree, the distribution is roughly symmetric. If they diverge, dig in: outliers or skew may be hiding important information.

.describe() — the multi-statistic summary

Get all the common statistics in one call:

Code Block
Python 3.13.2

On a DataFrame, describe does the same per numeric column. On a non-numeric column, it returns count / unique / top / freq.

.agg() — multiple aggregations at once

.agg lets you compute several aggregations in one go, getting back a Series or DataFrame of results.

Code Block
Python 3.13.2

You can also pass a dict to apply different aggregations to different columns:

Code Block
Python 3.13.2

Aggregating filtered subsets

You can combine filters with aggregations to answer questions like "average salary of engineers."

Code Block
Python 3.13.2

This works, but if you find yourself doing it for every department, you want groupby — coming up next chapter.

value_counts — a one-shot category breakdown

For a categorical column, .value_counts() gives you both the distinct values and their counts.

Code Block
Python 3.13.2

normalize=True converts counts to proportions — the fastest way to compute category shares.

Counting nulls and uniques

Two cheap and useful one-liners:

Code Block
Python 3.13.2
Initialization code (Python)read-only

Check your understanding

QuestionSelect one

If a column has values [10, 20, NaN, 40, NaN], what does .mean() return by default?

NaN

14.0

23.33 (= (10+20+40) / 3)

70.0

QuestionSelect one

A salary column has values [50, 55, 60, 60, 65, 70, 75, 500]. Which statistic is least affected by the 500 outlier?

The mean

The sum

The median

The max

QuestionSelect one

What does s.value_counts(normalize=True) produce?

The counts of unique values

A list of unique values

The proportions (fractions of total) of each unique value

A sorted Series

On this page