Statistical Summaries
describe, mean, median, std, quantiles, value_counts, and correlations — the everyday vocabulary of summary statistics.
Summary statistics compress a big, messy distribution into a few interpretable numbers. They're the first vocabulary every analyst needs.
The cheat sheet
| Statistic | Meaning |
|---|---|
| count | How many non-missing values |
| mean | Arithmetic average |
| median | Middle value (50th percentile) |
| std | Standard deviation — how spread out |
| min / max | Smallest / largest value |
| 25% / 50% / 75% | Quartiles |
| mode | Most common value |
| nunique | How many distinct values |
describe — the one-shot summary
describe() shows everything at once for numeric columns. Use
.T (transpose) to put columns as rows — much easier to read
when you have many columns.
For non-numeric columns:
You get count, unique, top (most common), and freq.
Mean vs Median — the classic trap
The mean salary is $165K — almost nobody earns that. The median is $52K — much more representative. For skewed distributions, prefer median. Income, prices, durations, and file sizes are almost always skewed.
Quantiles — beyond the median
Quantiles are great for things like SLAs ("95% of our requests finish in under X ms") or pay-band analyses.
Spread — std and IQR
std (standard deviation) is the typical distance from the mean. IQR (interquartile range = Q3 − Q1) is the middle 50% spread. IQR is more robust to outliers; std is more sensitive.
value_counts — the categorical describe
normalize=True gives proportions instead of counts — perfect
for "what share of our customers are in each tier?".
Correlation
The correlation between two columns ranges from -1 to +1:
| Value | Meaning |
|---|---|
| +1.0 | Perfect positive — they move identically |
| +0.7 | Strong positive — they often move together |
| 0 | No linear relationship |
| -0.7 | Strong negative — one goes up, the other goes down |
| -1.0 | Perfect negative |
Correlation ≠ causation
A high correlation does not mean A causes B. Both might be driven by a third variable, or it could be coincidence. Always investigate further before claiming a cause.
Grouped summaries
The natural extension — describe per group:
Outlier flags via the 1.5×IQR rule
A common rule of thumb: anything beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR is a candidate outlier.
This is a heuristic, not a verdict — you still have to decide whether the outlier is wrong data, a real edge case, or the most important customer on the list.
Mini challenge
Given a Series values (provided), build a Python dict summary containing exactly these keys with the indicated values:
- "count": non-missing count (int)
- "mean": arithmetic mean (float)
- "median": median (float)
- "p95": 95th percentile (float)
- "iqr": Q3 minus Q1 (float)
- "n_outliers": number of values outside [Q1 - 1.5IQR, Q3 + 1.5IQR] (int)
Check your understanding
Reporting income data for a country, which is usually the more representative single number?
mean
median — incomes are right-skewed and median is robust to the few very high earners
mode
std
What does value_counts(normalize=True) return?
Sorted values
Z-scores
Proportions (relative frequencies) instead of raw counts — perfect for "what share of rows fall into each category"
A heatmap
Two columns have a correlation of -0.95. The correct interpretation is:
One causes the other
They are unrelated
They have a very strong inverse linear relationship — when one goes up, the other tends to go down. (No claim about cause.)
It is a bug
The 1.5×IQR rule flags a few "outliers" in your data. What is the correct next step?
Delete them immediately
Ignore them
Investigate — they might be data errors, sentinel values, or genuinely important edge cases. The decision to drop, cap, or keep depends on context.
Replace them with the mean