Histograms
The chart that shows the *shape* of one variable — distribution, skew, and outliers
A histogram answers a single, fundamental question: what does the distribution of this one variable look like? Where do most values cluster? Is the distribution symmetric or skewed? Are there outliers? Is it bell-shaped, uniform, or bimodal?
You will use histograms constantly during exploratory analysis — every time you load a new dataset, you should histogram every numeric column as part of getting acquainted.
How a histogram works
A histogram chops the range of a numeric variable into bins (usually equal-width buckets) and shows the count of values that fall into each bin. The result looks like a bar chart, but the bars represent intervals, not categories.
The simplest histogram
You can see immediately: most bills are between 40.
Bin width matters — a lot
The number of bins you choose changes how the distribution looks. Too few bins hide structure; too many bins create noise.
Edit the code: try nbins=5, then 100. You'll see two failure
modes — too-few hides the shape; too-many turns the chart into
noise.
A useful default heuristic is the square-root rule: number of
bins ≈ √n, where n is the row count. For tips() (244 rows), √244
is about 16, which is a reasonable starting point. Plotly's own
default is sensible for most datasets; only override when you
have a reason.
Comparing groups: stacked, grouped, or overlaid
Add color="..." to split the histogram by a categorical variable:
barmode="overlay" with reduced opacity lets two distributions sit
on top of each other so you can compare their shapes. Other
options:
barmode="stack"— stacks the counts (good for showing composition, bad for comparing distributions).barmode="group"— places bars side by side per bin (can get noisy fast).
Histograms of proportions with histnorm
If your groups have very different sizes, counts are misleading —
the larger group will always have taller bars. Use histnorm to
normalize:
Now the bars represent probability density (so each group's total area sums to 1), making the shapes directly comparable regardless of group size.
Other histnorm values: "percent", "probability", "density".
When NOT to use a histogram
- For categorical data, use a bar chart (
px.bar) showing counts per category, not a histogram. - For comparing summary statistics across many groups, a box plot (next page) is more concise.
- For two numeric variables, use a 2-D density heatmap or scatter plot.
A real-world reading of a histogram
When you look at a histogram, ask:
- Where is the center? (Mode, median, mean.)
- How spread out is it? (Standard deviation, IQR.)
- Is it symmetric or skewed? Most real-world distributions (income, wait times, page views) are right-skewed — a long tail of large values.
- Are there multiple peaks (bimodality)? This often signals two mixed populations, which is a big analytical clue.
- Are there outliers / extreme values?
Train yourself to ask these five questions on every histogram you ever see.
Check your understanding
What is a histogram designed to show?
The relationship between two variables.
A comparison across distinct categories.
The distribution (shape, spread, center, skew) of a single numeric variable.
A trend over time.
What happens if you choose too few bins for a histogram?
The chart looks noisy.
The chart fails to render.
Structure in the distribution (e.g., bimodality, fine clustering) gets averaged away — you only see a coarse outline.
You're comparing the bill-size distribution between two groups with very different group sizes. Why might raw counts on a histogram be misleading?
Raw counts are illegal.
Raw counts force a log scale.
The larger group will always have taller bars, even if the shape of the distributions is similar — counts confound group size with distribution shape.
Raw counts are always wrong.
Which of the following is a sign of a bimodal distribution on a histogram?
A single tall peak in the middle.
A long tail on the right side.
Two distinct peaks separated by a valley.
All bars at the same height.
Which question would a histogram NOT help answer?
"What's the most common range of values?"
"Are there outliers far from the typical range?"
"What's the correlation between income and education?"