Statistical Summaries

describe, mean, median, std, quantiles, value_counts, and correlations — the everyday vocabulary of summary statistics.

Summary statistics compress a big, messy distribution into a few interpretable numbers. They're the first vocabulary every analyst needs.

The cheat sheet

Statistic	Meaning
count	How many non-missing values
mean	Arithmetic average
median	Middle value (50th percentile)
std	Standard deviation — how spread out
min / max	Smallest / largest value
25% / 50% / 75%	Quartiles
mode	Most common value
nunique	How many distinct values

describe — the one-shot summary

Initialization code (Python)read-only

describe() shows everything at once for numeric columns. Use .T (transpose) to put columns as rows — much easier to read when you have many columns.

For non-numeric columns:

Initialization code (Python)read-only

You get count, unique, top (most common), and freq.

Mean vs Median — the classic trap

The mean salary is $165K — almost nobody earns that. The median is $52K — much more representative. For skewed distributions, prefer median. Income, prices, durations, and file sizes are almost always skewed.

Quantiles — beyond the median

Quantiles are great for things like SLAs ("95% of our requests finish in under X ms") or pay-band analyses.

Spread — std and IQR

std (standard deviation) is the typical distance from the mean. IQR (interquartile range = Q3 − Q1) is the middle 50% spread. IQR is more robust to outliers; std is more sensitive.

value_counts — the categorical describe

Initialization code (Python)read-only

normalize=True gives proportions instead of counts — perfect for "what share of our customers are in each tier?".

Correlation

Initialization code (Python)read-only

The correlation between two columns ranges from -1 to +1:

Value	Meaning
+1.0	Perfect positive — they move identically
+0.7	Strong positive — they often move together
0	No linear relationship
-0.7	Strong negative — one goes up, the other goes down
-1.0	Perfect negative

Correlation ≠ causation

A high correlation does not mean A causes B. Both might be driven by a third variable, or it could be coincidence. Always investigate further before claiming a cause.

Grouped summaries

The natural extension — describe per group:

Initialization code (Python)read-only

Outlier flags via the 1.5×IQR rule

A common rule of thumb: anything beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR is a candidate outlier.

This is a heuristic, not a verdict — you still have to decide whether the outlier is wrong data, a real edge case, or the most important customer on the list.

Mini challenge

Given a Series values (provided), build a Python dict summary containing exactly these keys with the indicated values:

"count": non-missing count (int)
"mean": arithmetic mean (float)
"median": median (float)
"p95": 95th percentile (float)
"iqr": Q3 minus Q1 (float)
"n_outliers": number of values outside [Q1 - 1.5IQR, Q3 + 1.5IQR] (int)

Check your understanding

QuestionSelect one

Reporting income data for a country, which is usually the more representative single number?

mean

median — incomes are right-skewed and median is robust to the few very high earners

mode

std

QuestionSelect one

What does value_counts(normalize=True) return?

Sorted values

Z-scores

Proportions (relative frequencies) instead of raw counts — perfect for "what share of rows fall into each category"

A heatmap

QuestionSelect one

Two columns have a correlation of -0.95. The correct interpretation is:

One causes the other

They are unrelated

They have a very strong inverse linear relationship — when one goes up, the other tends to go down. (No claim about cause.)

It is a bug

QuestionSelect one

The 1.5×IQR rule flags a few "outliers" in your data. What is the correct next step?

Delete them immediately

Ignore them

Investigate — they might be data errors, sentinel values, or genuinely important edge cases. The decision to drop, cap, or keep depends on context.

Replace them with the mean

The EDA Workflow

A repeatable, opinionated approach to getting to know a new dataset — and why every analyst needs one.

Hypothesis Intuition

A gentle, intuition-first introduction to comparing groups, evaluating evidence, and not fooling yourself.

Statistical Summaries

On this page