Shape and Outliers

Reading the shape of a distribution — skewness, modality, and tail heaviness — and a disciplined way to find outliers and decide what to actually do with them.

Center and spread compress a distribution into two numbers, but they can't tell you what it looks like. Is it a tidy hill? A long tail dragging to the right? Two separate humps that got mixed together? Does it throw out the occasional wild value? Those questions are about shape, and shape decides which statistics are even valid to use. This page reads shape three ways — skewness, modality, and tail heaviness — and then tackles the most misunderstood topic in applied statistics: outliers, and what to actually do about them.

The headline you should carry in: an outlier is a question, not a verdict. "This point is far from the others" tells you to look, not to delete.

Skewness: which way the tail leans

A distribution is symmetric if its left and right halves mirror each other. It's skewed when one tail is stretched longer than the other. The naming trips people up, so anchor it firmly: skew is named for the direction of the long tail, not where the bulk of the data sits.

Right-skewed (positive skew): a long tail toward high values. The bulk is on the left, a few large values stretch right. The mean is pulled above the median. This is the default for money, durations, counts, and sizes.
Left-skewed (negative skew): a long tail toward low values. The bulk is on the right, a few small values stretch left. The mean is pulled below the median. Think exam scores with a ceiling, or age at death.

Misconception: 'right-skewed' means the peak is on the right

It's the opposite. Right-skewed means the tail points right, so the peak sits on the left. The trick that always works: the skew points the way the tail (and the mean) gets dragged. Mean above median → right skew; mean below median → left skew.

scipy.stats.skew puts a number on this. Zero is symmetric, positive is right-skewed, negative is left-skewed.

Notice the mean − median column tracks the skew sign exactly. That's the cheap field test from Measures of Center: you don't need to compute a skewness coefficient to suspect skew — just compare the mean and median.

Modality: how many peaks

Modality counts the peaks (modes) in a distribution. Unimodal data has one hump. Bimodal data has two, and multimodal more — and a second peak is almost always a story: two populations got mixed into one column. Heights of a mixed-gender group, response times split between cache-hit and cache-miss, purchase amounts from two customer segments. When you see two peaks, the right move is usually to split the groups, because no single center or spread honestly describes a two-humped distribution.

A single center can describe nobody

The mean of that bimodal column falls in the empty valley between the two humps — no actual person is near it, exactly like the "average of a teacher and a billionaire." Bimodality is the clearest case where summarizing before plotting misleads you. When a mean lands where the data is sparse, suspect mixed groups.

Tail heaviness: kurtosis (lightly)

The last shape question is how heavy the tails are — how often you get values far from the center. Kurtosis measures this. You rarely need the exact number; what matters is the intuition: heavy-tailed distributions produce extreme values far more often than a normal distribution would, so "rare" events aren't actually that rare.

scipy.stats.kurtosis reports excess kurtosis by default (it subtracts 3, so a normal distribution reads ~0). Positive means heavier tails than normal; negative means lighter.

Why heavy tails matter in practice

Financial returns, insurance losses, network delays, and file sizes often have heavy tails. If you assume normal-sized wiggles, you'll be blindsided by "impossible" extremes that are actually routine for the distribution. Heavy tails are also why a single robust spread (IQR) can beat the standard deviation — one fat-tailed draw can dominate the SD.

Outliers: what they are (and aren't)

An outlier is a value that sits far from the rest of the data. That is all the definition says — "far away." It does not say "wrong," "error," or "delete me." Outliers come in three flavors, and they call for opposite responses:

Errors: a typo, a unit mix-up (kg vs lb), a sensor glitch, a -999 sentinel someone used for "missing." These should be fixed or removed — once you've confirmed they're errors.
Sentinels / structural junk: placeholder values like 0, 9999, or 1970-01-01 standing in for "unknown." Handle them as missing data, not as real measurements.
Genuine extreme values: a real whale customer, a real fraud case, a real record-breaking day. These are often the most important rows in the dataset — deleting them throws away the signal you're being paid to find.

Misconception: outliers are errors you should delete

The single most damaging habit in data cleaning. Auto-deleting outliers hides fraud, removes your best customers, erases the rare events you most need to model, and quietly biases every downstream estimate. The correct default is investigate, then decide — never delete on sight.

Two rules for flagging outliers

Flagging is mechanical; deciding is human. Two standard rules flag candidates:

The 1.5×IQR rule (box-plot fences). Compute Q1, Q3, and IQR = Q3 − Q1. Anything below Q1 − 1.5×IQR or above Q3 + 1.5×IQR is flagged. Because it's built on quartiles, it's robust — the outliers themselves don't move the fences — so it works on skewed data.

The z-score rule. Compute each point's z = (x − mean) / std and flag anything with |z| > 3 (sometimes 2). It asks "how many standard deviations from the mean?" But it's built on the mean and standard deviation, which the outliers themselves inflate — so on skewed or contaminated data it can hide the very points it's meant to catch.

The two rules often disagree, and seeing how is genuinely clarifying.

A box plot draws those IQR fences for you — the "whiskers" extend to the last point inside the fence, and anything beyond is plotted as an individual outlier dot.

From flag to decision

Once a point is flagged, the real work begins: why is it out there?

Practical outlier hygiene

Always look at the flagged rows individually — pull them up and read them. 2. Diagnose the cause before acting. 3. If they're real, keep them and switch to robust statistics (median, IQR, MAD) rather than deleting. 4. If you must remove or cap, report results both with and without so the decision is transparent. 5. Document every exclusion. "I dropped 3 rows" with no reason is a red flag in any analysis.

QuestionSelect one

A column of purchase amounts has a few values near $50,000 while most are under $200. Investigation shows these are real bulk orders from genuine wholesale customers. What's the appropriate action?

Delete them so they don't distort the average purchase amount

Keep them, and report robust summaries (median, IQR) alongside the mean so the bulk orders don't dominate the headline number

Replace them with the mean of the other values

Assume they are data-entry errors because they are so large

Putting it together: an outlier flagger

A reusable IQR-based flagger is something you'll write constantly in EDA. The pattern: compute the fences, return which values fall outside, and how many.

Implement the standard 1.5×IQR outlier rule on the provided pandas Series readings.

Compute (using the 25th and 75th percentiles):

q1, q3 — the lower and upper quartiles
iqr = q3 - q1
lower_fence = q1 - 1.5 * iqr
upper_fence = q3 + 1.5 * iqr

Then produce:

outliers — a Python list of the values in readings that fall below lower_fence or above upper_fence (strictly outside the fences).
n_outliers — an int, the count of those outliers.

Use the provided readings Series.

Check your understanding

QuestionSelect one

A distribution has a long tail stretching toward high values and most of its mass on the left. How is it described, and how do the mean and median compare?

Left-skewed, with the mean below the median

Right-skewed, with the mean above the median

Symmetric, since the peak is on one side

Bimodal, because the mass and the tail are separated

QuestionSelect one

On strongly right-skewed data, the z-score rule (|z| > 3) flags fewer points than the 1.5×IQR rule. Why?

The IQR rule is simply more sensitive by design and always flags more

The extreme high values inflate the mean and standard deviation that the z-score uses, raising the |z| > 3 threshold so those very points fall short of it

The z-score rule only works on integer data

The IQR rule ignores the median, making it flag more points

QuestionSelect one

You find three values of exactly -999 in a temperature column that otherwise ranges from −20 to 45. What is the most likely explanation and correct handling?

They are genuine record-cold readings worth keeping

They are almost certainly a sentinel value for "missing," and should be treated as missing data rather than as real temperatures

They are random outliers to flag with the z-score rule and keep

They prove the sensor is broken and the whole column should be discarded

QuestionSelect one

Which statement about kurtosis is accurate?

High (positive excess) kurtosis means the data is strongly skewed

It measures how many peaks (modes) a distribution has

Positive excess kurtosis indicates heavier tails than a normal distribution — extreme values occur more often than a normal model predicts

A normal distribution has an excess kurtosis of 3

Key takeaways

Shape decides which statistics are valid; center and spread alone can't reveal it — plot the distribution.
Skew is named for the long tail: right-skew → mean above median; left-skew → mean below median.
A second peak (bimodality) usually means two groups mixed together — split them; no single center fits.
Kurtosis is tail heaviness: heavy tails make "rare" extremes routine.
Outliers are questions, not verdicts. Flag with the robust 1.5×IQR rule (or z-scores, cautiously); then investigate and decide — fix errors, treat sentinels as missing, and keep genuine extremes while switching to robust statistics. Never auto-delete.

Shape and Outliers

On this page