Shape and Outliers
Reading the shape of a distribution — skewness, modality, and tail heaviness — and a disciplined way to find outliers and decide what to actually do with them.
Center and spread compress a distribution into two numbers, but they can't tell you what it looks like. Is it a tidy hill? A long tail dragging to the right? Two separate humps that got mixed together? Does it throw out the occasional wild value? Those questions are about shape, and shape decides which statistics are even valid to use. This page reads shape three ways — skewness, modality, and tail heaviness — and then tackles the most misunderstood topic in applied statistics: outliers, and what to actually do about them.
The headline you should carry in: an outlier is a question, not a verdict. "This point is far from the others" tells you to look, not to delete.
Skewness: which way the tail leans
A distribution is symmetric if its left and right halves mirror each other. It's skewed when one tail is stretched longer than the other. The naming trips people up, so anchor it firmly: skew is named for the direction of the long tail, not where the bulk of the data sits.
- Right-skewed (positive skew): a long tail toward high values. The bulk is on the left, a few large values stretch right. The mean is pulled above the median. This is the default for money, durations, counts, and sizes.
- Left-skewed (negative skew): a long tail toward low values. The bulk is on the right, a few small values stretch left. The mean is pulled below the median. Think exam scores with a ceiling, or age at death.
Misconception: 'right-skewed' means the peak is on the right
It's the opposite. Right-skewed means the tail points right, so the peak sits on the left. The trick that always works: the skew points the way the tail (and the mean) gets dragged. Mean above median → right skew; mean below median → left skew.
scipy.stats.skew puts a number on this. Zero is symmetric, positive
is right-skewed, negative is left-skewed.
Notice the mean − median column tracks the skew sign exactly. That's
the cheap field test from Measures of Center: you don't need to
compute a skewness coefficient to suspect skew — just compare the mean
and median.
Modality: how many peaks
Modality counts the peaks (modes) in a distribution. Unimodal data has one hump. Bimodal data has two, and multimodal more — and a second peak is almost always a story: two populations got mixed into one column. Heights of a mixed-gender group, response times split between cache-hit and cache-miss, purchase amounts from two customer segments. When you see two peaks, the right move is usually to split the groups, because no single center or spread honestly describes a two-humped distribution.
A single center can describe nobody
The mean of that bimodal column falls in the empty valley between the two humps — no actual person is near it, exactly like the "average of a teacher and a billionaire." Bimodality is the clearest case where summarizing before plotting misleads you. When a mean lands where the data is sparse, suspect mixed groups.
Tail heaviness: kurtosis (lightly)
The last shape question is how heavy the tails are — how often you get values far from the center. Kurtosis measures this. You rarely need the exact number; what matters is the intuition: heavy-tailed distributions produce extreme values far more often than a normal distribution would, so "rare" events aren't actually that rare.
scipy.stats.kurtosis reports excess kurtosis by default (it
subtracts 3, so a normal distribution reads ~0). Positive means heavier
tails than normal; negative means lighter.
Why heavy tails matter in practice
Financial returns, insurance losses, network delays, and file sizes often have heavy tails. If you assume normal-sized wiggles, you'll be blindsided by "impossible" extremes that are actually routine for the distribution. Heavy tails are also why a single robust spread (IQR) can beat the standard deviation — one fat-tailed draw can dominate the SD.
Outliers: what they are (and aren't)
An outlier is a value that sits far from the rest of the data. That is all the definition says — "far away." It does not say "wrong," "error," or "delete me." Outliers come in three flavors, and they call for opposite responses:
- Errors: a typo, a unit mix-up (kg vs lb), a sensor glitch, a
-999sentinel someone used for "missing." These should be fixed or removed — once you've confirmed they're errors. - Sentinels / structural junk: placeholder values like
0,9999, or1970-01-01standing in for "unknown." Handle them as missing data, not as real measurements. - Genuine extreme values: a real whale customer, a real fraud case, a real record-breaking day. These are often the most important rows in the dataset — deleting them throws away the signal you're being paid to find.
Misconception: outliers are errors you should delete
The single most damaging habit in data cleaning. Auto-deleting outliers hides fraud, removes your best customers, erases the rare events you most need to model, and quietly biases every downstream estimate. The correct default is investigate, then decide — never delete on sight.
Two rules for flagging outliers
Flagging is mechanical; deciding is human. Two standard rules flag candidates:
The 1.5×IQR rule (box-plot fences). Compute Q1, Q3, and
IQR = Q3 − Q1. Anything below Q1 − 1.5×IQR or above
Q3 + 1.5×IQR is flagged. Because it's built on quartiles, it's
robust — the outliers themselves don't move the fences — so it
works on skewed data.
The z-score rule. Compute each point's z = (x − mean) / std and
flag anything with |z| > 3 (sometimes 2). It asks "how many standard
deviations from the mean?" But it's built on the mean and standard
deviation, which the outliers themselves inflate — so on skewed or
contaminated data it can hide the very points it's meant to catch.
The two rules often disagree, and seeing how is genuinely clarifying.
A box plot draws those IQR fences for you — the "whiskers" extend to the last point inside the fence, and anything beyond is plotted as an individual outlier dot.
From flag to decision
Once a point is flagged, the real work begins: why is it out there?
Practical outlier hygiene
- Always look at the flagged rows individually — pull them up and read them. 2. Diagnose the cause before acting. 3. If they're real, keep them and switch to robust statistics (median, IQR, MAD) rather than deleting. 4. If you must remove or cap, report results both with and without so the decision is transparent. 5. Document every exclusion. "I dropped 3 rows" with no reason is a red flag in any analysis.
A column of purchase amounts has a few values near $50,000 while most are under $200. Investigation shows these are real bulk orders from genuine wholesale customers. What's the appropriate action?
Delete them so they don't distort the average purchase amount
Keep them, and report robust summaries (median, IQR) alongside the mean so the bulk orders don't dominate the headline number
Replace them with the mean of the other values
Assume they are data-entry errors because they are so large
Putting it together: an outlier flagger
A reusable IQR-based flagger is something you'll write constantly in EDA. The pattern: compute the fences, return which values fall outside, and how many.
Implement the standard 1.5×IQR outlier rule on the provided pandas Series readings.
Compute (using the 25th and 75th percentiles):
q1,q3— the lower and upper quartilesiqr=q3 - q1lower_fence=q1 - 1.5 * iqrupper_fence=q3 + 1.5 * iqr
Then produce:
outliers— a Pythonlistof the values inreadingsthat fall belowlower_fenceor aboveupper_fence(strictly outside the fences).n_outliers— anint, the count of those outliers.
Use the provided readings Series.
Check your understanding
A distribution has a long tail stretching toward high values and most of its mass on the left. How is it described, and how do the mean and median compare?
Left-skewed, with the mean below the median
Right-skewed, with the mean above the median
Symmetric, since the peak is on one side
Bimodal, because the mass and the tail are separated
On strongly right-skewed data, the z-score rule (|z| > 3) flags fewer points than the 1.5×IQR rule. Why?
The IQR rule is simply more sensitive by design and always flags more
The extreme high values inflate the mean and standard deviation that the z-score uses, raising the |z| > 3 threshold so those very points fall short of it
The z-score rule only works on integer data
The IQR rule ignores the median, making it flag more points
You find three values of exactly -999 in a temperature column that otherwise ranges from −20 to 45. What is the most likely explanation and correct handling?
They are genuine record-cold readings worth keeping
They are almost certainly a sentinel value for "missing," and should be treated as missing data rather than as real temperatures
They are random outliers to flag with the z-score rule and keep
They prove the sensor is broken and the whole column should be discarded
Which statement about kurtosis is accurate?
High (positive excess) kurtosis means the data is strongly skewed
It measures how many peaks (modes) a distribution has
Positive excess kurtosis indicates heavier tails than a normal distribution — extreme values occur more often than a normal model predicts
A normal distribution has an excess kurtosis of 3
Key takeaways
- Shape decides which statistics are valid; center and spread alone can't reveal it — plot the distribution.
- Skew is named for the long tail: right-skew → mean above median; left-skew → mean below median.
- A second peak (bimodality) usually means two groups mixed together — split them; no single center fits.
- Kurtosis is tail heaviness: heavy tails make "rare" extremes routine.
- Outliers are questions, not verdicts. Flag with the robust 1.5×IQR rule (or z-scores, cautiously); then investigate and decide — fix errors, treat sentinels as missing, and keep genuine extremes while switching to robust statistics. Never auto-delete.
Measures of Spread
Range, variance, standard deviation, IQR, MAD, and the coefficient of variation — why how spread out the data is matters as much as where its center sits.
Visualizing Distributions
Why you should see a distribution before you summarize or test it — histograms, box plots, violins, ECDFs, and KDEs, and how a single summary statistic can lie.