Box and Violin Plots
Summarizing a distribution with quartiles — the anatomy of a box plot, the shape a violin adds, and the one thing a box plot dangerously hides.
When you have a numeric variable split across several categories, you often don't want every raw point or a single bar — you want a compact summary of each group's distribution so you can compare them at a glance. That's what box plots and violin plots are for.
This page is foundational: box plots are everywhere, and the one thing they hide can genuinely mislead you, so we'll go carefully.
Anatomy of a box plot
A box plot (or box-and-whisker plot) compresses a distribution into a handful of robust summary numbers based on quartiles — the values that cut the sorted data into four equal parts.
Reading the parts:
- The box spans the first quartile (Q1, 25th percentile) to the third quartile (Q3, 75th percentile). Its height is the interquartile range (IQR) — the middle 50% of the data.
- The line inside the box is the median (Q2, 50th percentile).
- The whiskers extend to the most extreme points still within 1.5 × IQR of the box edges.
- Anything past the whiskers is drawn as an individual outlier.
You can read spread from the box height, skew from where the median sits in the box and from unequal whisker lengths, and outliers directly. Here's a real one, annotated:
Comparing groups with box plots
The real power of box plots is comparison: line several up, one per
category, and differences in center and spread jump out. Use catplot with
kind="box":
Four groups, four five-number summaries, instantly comparable. Box plots stay readable even with many categories, which is exactly why analysts reach for them so often. Data types: a categorical grouping variable and a numeric variable to summarize.
The dangerous thing a box plot hides
A box plot only knows about quartiles. It is completely blind to the shape of the data between those landmarks — most importantly, it can't show whether a group is bimodal (has two peaks). Two wildly different distributions can produce the same box plot.
The two boxes look nearly identical — yet group B is strongly bimodal and group A is not. The box plot erased that completely. Now swap to a violin:
The difference is now obvious: group B's violin has two bulges. This is the single most important reason to know about violin plots.
Two groups produce nearly identical box plots. What can you safely conclude?
The two groups have the same distribution.
They have similar quartiles, but could still have very different shapes (e.g. one bimodal).
One group must contain outliers and the other must not.
The medians are different.
Violin plots: box plus shape
A violin plot combines a box plot's summary with a KDE (the smooth
density from the previous chapters) mirrored on both sides. The width at any
height shows how many observations sit there, so you see modality and skew
that a box hides. The inner argument controls what's drawn inside:
Because a violin is a KDE, it inherits the KDE's caveats: it needs enough data per group to be trustworthy, and its smoothing can spill past the data's real range. A violin drawn over five points is decoration, not evidence.
Violins need data; box plots tolerate scarcity
With small groups, a box plot's quartiles are still meaningful, but a violin's smooth curve is largely made up. As a rough guide, prefer box (or just show the raw points with a strip/swarm) for small groups, and bring in violins when each group has plenty of observations.
Boxen plots: detail in the tails
For large datasets, a kind="boxen" (letter-value) plot extends the box
idea by drawing progressively smaller boxes for further-out quantiles. It
shows the tails in much more detail than a single pair of whiskers, without
the smoothing assumptions of a violin:
Choosing among the three
| Plot | Shows | Best when | Weakness |
|---|---|---|---|
Box (kind="box") | quartiles, whiskers, outliers | comparing many groups; robust summary | blind to shape/modality |
Violin (kind="violin") | full density + quartiles | each group has plenty of data and shape matters | unreliable / made-up for small groups |
Boxen (kind="boxen") | many quantiles, detailed tails | large datasets where the tails matter | less familiar to readers |
And remember the option from the next page: when groups are small, just show every point with a strip or swarm plot.
Your turn
Using penguins, draw a violin plot with sns.catplot that
compares body_mass_g across the three species:
x="species",y="body_mass_g",kind="violin".
Assign the result to g. The violins will show each species' full mass
distribution, not just its quartiles.
Check your understanding
In a standard box plot, what does the box itself (from one edge to the other) represent?
The full range from the minimum to the maximum value.
The interquartile range — from Q1 (25th percentile) to Q3 (75th percentile), i.e. the middle 50% of the data.
One standard deviation on either side of the mean.
The 95% confidence interval of the mean.
A point is drawn beyond the whisker of a box plot. By the usual rule, what does that mean?
It is the maximum value, which is always drawn separately.
It lies more than 1.5 × IQR beyond the nearer quartile, so it's flagged as an outlier.
It is any value above the median.
It is a data-entry error.
When is a violin plot a poor choice compared with a box plot or a strip/swarm plot?
When you have many categories to compare.
When each group has very few observations, so its smoothed density is largely invented.
When the variable on the y-axis is numeric.
When you want to compare medians.
You have a large dataset and care specifically about how the tails behave across categories. Which plot is designed for that?
A standard box plot.
A boxen (letter-value) plot.
A count plot.
A bar plot of means.
Box plots summarize, violins reveal shape, boxen plots expose tails — but all three still summarize. Next we go the other direction and plot every single observation with strip and swarm plots.