Dataslope logoDataslope

Statistical Summaries

describe, mean, median, std, quantiles, value_counts, and correlations — the everyday vocabulary of summary statistics.

Summary statistics compress a big, messy distribution into a few interpretable numbers. They're the first vocabulary every analyst needs.

The cheat sheet

StatisticMeaning
countHow many non-missing values
meanArithmetic average
medianMiddle value (50th percentile)
stdStandard deviation — how spread out
min / maxSmallest / largest value
25% / 50% / 75%Quartiles
modeMost common value
nuniqueHow many distinct values

describe — the one-shot summary

Code Block
Python 3.13.2
Initialization code (Python)read-only

describe() shows everything at once for numeric columns. Use .T (transpose) to put columns as rows — much easier to read when you have many columns.

For non-numeric columns:

Code Block
Python 3.13.2
Initialization code (Python)read-only

You get count, unique, top (most common), and freq.

Mean vs Median — the classic trap

Code Block
Python 3.13.2

The mean salary is $165K — almost nobody earns that. The median is $52K — much more representative. For skewed distributions, prefer median. Income, prices, durations, and file sizes are almost always skewed.

Quantiles — beyond the median

Code Block
Python 3.13.2

Quantiles are great for things like SLAs ("95% of our requests finish in under X ms") or pay-band analyses.

Spread — std and IQR

Code Block
Python 3.13.2

std (standard deviation) is the typical distance from the mean. IQR (interquartile range = Q3 − Q1) is the middle 50% spread. IQR is more robust to outliers; std is more sensitive.

value_counts — the categorical describe

Code Block
Python 3.13.2
Initialization code (Python)read-only

normalize=True gives proportions instead of counts — perfect for "what share of our customers are in each tier?".

Correlation

Code Block
Python 3.13.2
Initialization code (Python)read-only

The correlation between two columns ranges from -1 to +1:

ValueMeaning
+1.0Perfect positive — they move identically
+0.7Strong positive — they often move together
0No linear relationship
-0.7Strong negative — one goes up, the other goes down
-1.0Perfect negative

Correlation ≠ causation

A high correlation does not mean A causes B. Both might be driven by a third variable, or it could be coincidence. Always investigate further before claiming a cause.

Grouped summaries

The natural extension — describe per group:

Code Block
Python 3.13.2
Initialization code (Python)read-only

Outlier flags via the 1.5×IQR rule

A common rule of thumb: anything beyond Q1 − 1.5×IQR or Q3 + 1.5×IQR is a candidate outlier.

Code Block
Python 3.13.2

This is a heuristic, not a verdict — you still have to decide whether the outlier is wrong data, a real edge case, or the most important customer on the list.

Mini challenge

Challenge
Python 3.13.2
Compute a robust summary

Given a Series values (provided), build a Python dict summary containing exactly these keys with the indicated values:

  • "count": non-missing count (int)
  • "mean": arithmetic mean (float)
  • "median": median (float)
  • "p95": 95th percentile (float)
  • "iqr": Q3 minus Q1 (float)
  • "n_outliers": number of values outside [Q1 - 1.5IQR, Q3 + 1.5IQR] (int)

Check your understanding

QuestionSelect one

Reporting income data for a country, which is usually the more representative single number?

mean

median — incomes are right-skewed and median is robust to the few very high earners

mode

std

QuestionSelect one

What does value_counts(normalize=True) return?

Sorted values

Z-scores

Proportions (relative frequencies) instead of raw counts — perfect for "what share of rows fall into each category"

A heatmap

QuestionSelect one

Two columns have a correlation of -0.95. The correct interpretation is:

One causes the other

They are unrelated

They have a very strong inverse linear relationship — when one goes up, the other tends to go down. (No claim about cause.)

It is a bug

QuestionSelect one

The 1.5×IQR rule flags a few "outliers" in your data. What is the correct next step?

Delete them immediately

Ignore them

Investigate — they might be data errors, sentinel values, or genuinely important edge cases. The decision to drop, cap, or keep depends on context.

Replace them with the mean

On this page