Sampling and Distributions
Almost every dataset is a sample drawn from some bigger population. Understanding sampling — and the surprisingly orderly behavior of sample averages — turns raw data into evidence.
When pollsters say "we surveyed 1,000 voters," they're not saying the country has 1,000 voters. They're saying: from the many millions of voters (the population), we observed 1,000 of them (the sample) and we'll use that sample to make guesses about the population.
This page explains:
- The population/sample distinction and why it matters.
- How sample statistics behave (the sampling distribution).
- The Central Limit Theorem — the deep reason why "mean ± error" works even when individual data points are weird.
Population vs. sample
A few critical terms:
- Population: the full set of entities you care about (every voter, every customer, every star in the galaxy).
- Sample: the subset you actually observed.
- Parameter (μ, σ, π, …): a true value describing the population. Usually unknown.
- Statistic (x̄, s, p̂, …): a value you computed from the sample. Known but variable.
We use statistics to estimate parameters. The estimate is always uncertain because someone else sampling 1,000 different voters would get a slightly different x̄.
Drawing a sample in R
Built-in tools:
The sample statistics are close to but not equal to the population values. That gap is the cost of not measuring everyone.
Bias vs. variability
Two things can go wrong with a sample:
- Bias — your sampling method systematically misses certain cases. (Surveying only people who answer landlines biases toward older respondents.)
- Variability — even an unbiased method produces sample statistics that wiggle from one sample to the next.
Big sample sizes shrink variability but not bias. A biased survey of 10 million people is still biased. This is why how you sample matters as much as how much you sample.
The sampling distribution of the mean
If you took many samples of size 50 from our population and
computed mean() each time, you'd get a distribution of means.
Let's actually do it:
Three things to notice:
- The histogram is bell-shaped, centered on the population mean (~170 cm).
- The spread of the sampling distribution (the SD of
many_means) matchessd(population) / sqrt(n)— that's the standard error of the mean. - Although one sample bounces around, the distribution of possible sample means is highly predictable.
This regularity — the sampling distribution of x̄ being bell-shaped and narrower than the population — is what makes quantitative inference possible.
The Central Limit Theorem (in plain English)
Here's the magic: even if the underlying population is not bell-shaped, the distribution of sample means becomes bell-shaped as n grows. Let's see this with a very non-normal population — an exponential distribution (lots of small values, a long right tail):
The population is heavily skewed. With n = 2, the distribution of means still looks skewed. With n = 10, it's much closer to bell-shaped. By n = 50, it's nearly normal.
That is the Central Limit Theorem: the sampling distribution of the mean tends toward a normal distribution as n grows, no matter what shape the population has (with some technical caveats). It's why so many statistical methods quietly assume normality of the sample mean rather than of the data itself.
Why this matters in practice
The CLT is the reason you can:
- Build confidence intervals for the mean without knowing the exact shape of the population.
- Run a t-test on right-skewed data when n is large and not feel guilty.
- Trust the standard error formula
sd / sqrt(n)as a useful summary of uncertainty.
It does not help you when:
- Your sample is biased (no math fixes a bad sampling design).
- n is small and the population is wildly skewed.
- You care about something other than the mean (medians, maxima, etc., have their own sampling distributions).
A concrete example: estimating an average rating
Suppose 5,000 customers rated your product 1–5 stars. You can only afford to ask 100 of them. How accurately can you estimate the average rating?
The "approximate 95% interval" x̄ ± 2·SE is a quick rule of
thumb — about 95% of the time, an interval built like this
covers the true population mean. (We'll polish that idea on the
next page.)
Test your understanding
The standard error of the mean with sample size n and population SD σ is approximately:
Hint: the standard error should get smaller as the sample size n grows.
σ
σ / √n
σ × n
σ² / n
The Central Limit Theorem says:
All data is normally distributed if you collect enough.
The distribution of the sample mean becomes approximately normal as n grows, regardless of the population shape.
Skewed data can be ignored.
p-values are always small.
Why is "sampling 10 million biased people" still a problem?
It's never a problem if n is huge.
Bias is a systematic offset that big sample size does not fix — you'll just be very precisely wrong.
p-values become unreliable.
The CLT stops applying.
Mini challenge: estimate the mean with uncertainty
Given the population vector pop, draw a sample of size 30,
compute the sample mean x_bar and the standard error se, and
report a rough 95% interval ci of length 2 (low, high) using
the rule x̄ ± 2·SE.
Use set.seed(123), take a sample of size 30 from pop, store its mean in x_bar and standard error in se (use sd / sqrt(n)), and build a length-2 numeric vector ci containing x_bar - 2*se and x_bar + 2*se.
Next, we'll unpack what that interval really means — and explore p-values, confidence intervals, and how to keep your intuition honest when reading statistical claims.
Uncertainty and Variability
Real-world measurements are never identical, even when the underlying thing is the same. Distinguishing genuine signal from random variation is the heart of statistical thinking.
Intuition for Inference
Confidence intervals and p-values are the lingua franca of applied statistics — and the most misinterpreted ideas in all of science. Let's build correct intuition for what they really mean.