Exploring Distributions
Histograms, density plots, and boxplots — three ways of *seeing* the entire shape of a column at once. The visual companion to summary statistics.
A summary statistic compresses a column to one number. A distribution plot shows you the whole column at once. Each serves a different purpose, and the experienced analyst always uses both.
Three plots cover most of what you need:
- Histogram — counts in each bucket
- Density plot — a smooth version of the histogram
- Boxplot — a five-number summary as a picture, easy to compare across groups
Histograms
A histogram chops the range of values into bins and counts how many values fall in each. The result is a picture of where values pile up.
The breaks argument controls how many bins. The default usually
works, but try breaks = 5 and breaks = 30 to see how much
binning affects the impression.
Things to look for:
- Center: where is the bulk of the data?
- Spread: narrow or wide?
- Skew: long tail on the right (right-skewed) or on the left (left-skewed)?
- Modes: one peak (unimodal) or more (multimodal — often a sign of a hidden grouping variable)?
- Gaps and outliers: empty zones, lonely far-away values
Density plots
A density plot is a smoothed histogram. It often makes the shape easier to compare, especially across groups:
Density plots have one parameter to keep in mind: the bandwidth (how much smoothing). Too little, and you're back to a noisy histogram. Too much, and you've smoothed real features away. R's default usually does a reasonable job.
Boxplots
A boxplot summarizes a distribution as a picture of the five-number summary (min, Q1, median, Q3, max), with conventions for marking outliers.
- The box spans Q1 to Q3 (the middle 50%).
- The line in the middle is the median.
- The whiskers extend out to (roughly) the most extreme non-outlier values.
- Dots beyond the whiskers are conventional outliers (more than 1.5 × IQR past Q1 or Q3).
Boxplots really earn their keep when you want to compare distributions across groups:
In one picture: 4-cyl cars get the best mileage, 8-cyl the worst, and 6-cyl sits in between with a tight spread. Three groups, one glance.
ggplot2 versions
The base R plotting we've used so far is quick and convenient. In
the visualization section we'll learn ggplot2, which is more
verbose but enormously more flexible. Here's a sneak peek:
Reading a distribution: skew
Most real-world data is right-skewed (income, transaction sizes, web traffic, response times). The bulk of values cluster at low-to-medium values with a long tail of high values.
Notice how the histogram has a long right tail, and the boxplot has many outliers on the right side. Both are saying the same thing: the mean is bigger than the median, the rich are pulling the average up.
A common analytical move for right-skewed data: log-transform it. The transformed distribution often looks symmetric and becomes easier to work with statistically.
Test your understanding
A histogram of a column shows a long tail extending to the right of the bulk of the data. This distribution is:
left-skewed
right-skewed
bimodal
symmetric
In a boxplot, the "box" itself represents:
The mean ± one standard deviation.
The middle 50% of the data, from Q1 (25th percentile) to Q3 (75th percentile).
The full range of the data.
The confidence interval of the mean.
Why are boxplots especially useful when comparing several groups?
They have the fewest pixels.
They compress each group's full distribution into a small, standardized picture that lines up nicely side-by-side.
They show every individual data point.
They are the only plot that handles NAs.
Mini challenge: visualize a skewed column
Use a histogram and a boxplot, side by side, to explore the
Ozone column of the built-in airquality dataset (remember
it has NAs). Describe what you see in your own head; the test
just checks the plot was produced.
Produce two side-by-side plots: (1) a histogram of airquality$Ozone (NAs ignored), (2) a boxplot of the same. Set the layout with par(mfrow = c(1, 2)).
So far we've looked at one column at a time. The next page is about the relationships between columns — and where many of the most interesting analyses begin.
Summary Statistics
Mean, median, standard deviation, quantiles — the small set of numbers that lets you describe an entire column in a single sentence.
Relationships Between Variables
Scatterplots, correlations, and cross-tabulations — the toolkit for asking "does X have anything to do with Y?" and (importantly) interpreting the answer carefully.