Dataslope logoDataslope

Visualizing Distributions

Why you should see a distribution before you summarize or test it — histograms, box plots, violins, ECDFs, and KDEs, and how a single summary statistic can lie.

You can compute a mean, a standard deviation, and a skewness coefficient without ever looking at your data. You shouldn't. A handful of summary numbers can be identical across distributions that look nothing alike — one a tidy bell, another two humps, a third a flat slab with a spike. Before you summarize, before you run a test, plot the distribution. This page is the visual companion to Shape and Outliers: the same ideas (skew, modality, tails, outliers), seen directly instead of inferred from coefficients.

The plots are simple — histograms, box plots, violins, and ECDFs — but each answers a different question, and knowing which to reach for is a real skill. We'll use Plotly Express and the built-in tips dataset so every chart runs as-is.

Why summaries lie: always plot first

The classic demonstration is Anscombe's quartet — four datasets with nearly identical means, variances, and correlations that look completely different when plotted. The same trap shows up in one dimension: distributions with the same mean and standard deviation can have utterly different shapes. A summary is a lossy compression; the plot is the original.

Code Block
Python 3.13.2

Misconception: a mean and standard deviation fully describe a distribution

They describe center and spread and nothing else — not the number of peaks, not the skew, not the tails, not the outliers. Two datasets with identical means and SDs can be a smooth bell and a pair of separated humps. Reaching for a summary before a plot is how analysts miss bimodality, outliers, and skew that change the whole conclusion.

Histograms: the workhorse, and the bin trap

A histogram slices the range into bins and counts how many values fall in each. It's the first plot you reach for to see shape: skew, peaks, gaps, outliers all jump out. But a histogram has a hidden knob — the bin count — and the same data can tell different stories at different bin counts. Too few bins hides structure (two peaks blur into one); too many bins turns signal into noise (every bin has one or two points). There is no single "objective" bin count.

Code Block
Python 3.13.2

Misconception: there's one correct, objective bin count

Bin count is a choice, and it changes the story. Plotting libraries pick a default with a rule of thumb, but that default is not "the truth." Good practice: view your histogram at a few bin counts (and consider an ECDF, below, which has no bins at all) before concluding anything about shape.

QuestionSelect one

You plot a histogram of session durations with 6 bins and see a single smooth hump. A colleague plots the same data with 60 bins and sees two clear peaks. Who is right?

The 6-bin version, because fewer bins is less noisy

The 60-bin version, because more bins always shows more truth

Neither bin count is automatically authoritative; you should view several and likely confirm the two peaks with an ECDF or by checking for two subgroups

They cannot both be looking at the same data

Box plots: five-number summary and outliers at a glance

A box plot draws the distribution's skeleton: the box spans Q1 to Q3 (the IQR), a line marks the median, and the whiskers reach to the last points within the 1.5×IQR fences from Measures of Spread and Shape and Outliers. Points beyond the whiskers show as individual outlier dots. Box plots are compact and brilliant for comparing groups side by side — but they hide modality, because a box plot of a two-humped distribution looks just like a box plot of a one-humped one.

Code Block
Python 3.13.2

Box plots hide modality

Because a box plot only knows quartiles, two distinct humps and one broad hump can produce the identical box. When you need to see whether a group has multiple peaks, use a histogram or a violin plot, not a box plot alone.

Violin plots: shape plus summary

A violin plot is a box plot wearing the distribution's silhouette. It mirrors a smoothed density (a KDE, below) on each side, so you see the actual shape — including bimodality that a box plot would hide — while still comparing groups. The cost is that the smooth outline depends on a smoothing choice, so very small samples can look smoother and more reliable than they are.

Code Block
Python 3.13.2

KDE: a smooth histogram (intuition only)

A kernel density estimate (KDE) is the smooth curve you often see draped over a histogram or forming a violin's outline. The intuition: instead of dropping each point into a hard bin, you place a little smooth bump on each point and add the bumps up, giving a continuous estimate of the density. It trades the histogram's jagged, bin-dependent steps for a smooth curve — but it has its own knob (the bandwidth, analogous to bin width): too smooth erases peaks, too wiggly invents them. Treat a KDE as a smoothed impression of shape, not ground truth.

Histogram vs KDE

A histogram shows counts in discrete bins (honest about the raw data, jagged). A KDE shows a smooth density (easy to read, but smoothing can add or hide features). They're two views of the same thing; the bin-count trap and the bandwidth trap are the same trap in different clothes. When in doubt, the ECDF below sidesteps both.

ECDF: the plot with no knobs

The empirical cumulative distribution function (ECDF) answers, for every value x, "what fraction of the data is ≤ x?" You sort the data and walk upward from 0 to 1. Its superpower: it has no bins and no bandwidth — nothing to tune, nothing to accidentally hide. Every data point is shown exactly. It's less intuitive at a glance than a histogram, but it's the honest choice for reading off percentiles ("90% of orders are under \$40"), comparing groups, and spotting features without worrying you've smoothed them away.

Code Block
Python 3.13.2

The y-value at any x is a percentile read directly off the curve — which is exactly the quantity the second challenge below asks you to compute by hand.

Which plot when?

Each plot is tuned to a different question. This is the decision you'll make dozens of times in real EDA.

A reliable first pass

For a single variable, start with a histogram at a couple of bin counts to read shape, then an ECDF to nail down percentiles without binning artifacts. To compare groups, use box plots for a compact overview and violins when modality matters. Pair every plot with the shape statistics from Shape and Outliers — eyes and numbers cross-checking each other.

Challenge 1: histogram bin counts by hand

np.histogram is the numeric engine under every histogram: give it data and a bin count and it returns the counts per bin plus the bin edges — no chart required. Computing these yourself is how you'd power a custom plot or test a binning rule.

Challenge
Python 3.13.2
Compute histogram bin counts with NumPy

Use np.histogram to bin the provided array values into 10 equal-width bins.

Produce:

  • counts — a NumPy array of length 10 giving how many values fall in each bin.
  • edges — a NumPy array of length 11 giving the bin edges (np.histogram returns both).
  • total_counted — an int equal to the sum of counts (it should equal len(values), since the outer edges span the full range).
  • fullest_bin — an int, the index (0-based) of the bin with the most values, via np.argmax.

Use np.histogram(values, bins=10) and unpack its two return values.

Challenge 2: ECDF values and a threshold

An ECDF is just sorted data paired with cumulative proportions. For sorted values, the ECDF after the i-th point (1-based) is i / n. Reading the curve at a threshold answers "what fraction of the data is at or below this value?" — a percentile, computed directly.

Challenge
Python 3.13.2
Compute an ECDF and read it at a threshold

Build the empirical CDF of the provided array data from scratch.

Produce:

  • x_sorteddata sorted ascending, as a NumPy array (use np.sort).
  • ecdf — a NumPy array the same length as data where the i-th entry (0-based) is (i + 1) / n (the cumulative proportion up to and including x_sorted[i]). With n = len(data), the last value must be exactly 1.0.
  • prop_at_threshold — a plain Python float: the fraction of data less than or equal to threshold (the ECDF value at threshold). Compute it as (data <= threshold).mean().

Use the provided data array and threshold value.

Check your understanding

QuestionSelect one

Two distributions have the same mean and the same standard deviation. What can you conclude about their shapes?

They must have the same shape, since mean and SD define a distribution

Almost nothing — they could differ in skew, number of peaks, tail heaviness, and outliers, so you must plot them to compare shapes

They must both be approximately normal

One must be the other shifted left or right

QuestionSelect one

Which plot is the best choice when you specifically want to check whether a single variable is bimodal (has two peaks)?

A box plot

A histogram (viewed at a sensible bin count) or a violin plot

A single reported median

A bar chart of the mean

QuestionSelect one

A key advantage of the ECDF over a histogram is that it:

Looks more intuitive to non-technical audiences

Has no bin-count or bandwidth to tune, so it shows every data point without smoothing-driven artifacts and lets you read percentiles directly

Automatically removes outliers from the display

Always reveals bimodality more clearly than a histogram

QuestionSelect one

You change a histogram's bin count from 10 to 50 and the apparent shape changes noticeably. What's the right takeaway?

The data changed, so you should recollect it

The 50-bin version is correct because higher resolution is always better

Bin count is a display choice that shapes the impression; you should view several bin counts and cross-check with a bin-free view like an ECDF before trusting any single shape

Histograms are useless and should be avoided

Key takeaways

  • Plot before you summarize. Mean and SD compress away skew, modality, tails, and outliers — distributions with identical summaries can look nothing alike.
  • Histograms show shape but depend on a bin count that changes the story; there is no single "objective" number — view several.
  • Box plots are compact and great for comparing groups and flagging outliers, but they hide modality.
  • Violin plots add the full shape (revealing bimodality) on top of a box.
  • KDE smooths a histogram (with a bandwidth knob of its own); the ECDF has no knobs, shows every point, and lets you read percentiles directly.
  • Cross-check every plot with the shape statistics from Shape and Outliers, and lean on these views again when we study The Normal Distribution.

On this page