Dataslope logoDataslope

Sampling and Distributions

Almost every dataset is a sample drawn from some bigger population. Understanding sampling — and the surprisingly orderly behavior of sample averages — turns raw data into evidence.

When pollsters say "we surveyed 1,000 voters," they're not saying the country has 1,000 voters. They're saying: from the many millions of voters (the population), we observed 1,000 of them (the sample) and we'll use that sample to make guesses about the population.

This page explains:

  1. The population/sample distinction and why it matters.
  2. How sample statistics behave (the sampling distribution).
  3. The Central Limit Theorem — the deep reason why "mean ± error" works even when individual data points are weird.

Population vs. sample

A few critical terms:

  • Population: the full set of entities you care about (every voter, every customer, every star in the galaxy).
  • Sample: the subset you actually observed.
  • Parameter (μ, σ, π, …): a true value describing the population. Usually unknown.
  • Statistic (x̄, s, p̂, …): a value you computed from the sample. Known but variable.

We use statistics to estimate parameters. The estimate is always uncertain because someone else sampling 1,000 different voters would get a slightly different x̄.

Drawing a sample in R

Built-in tools:

Code Block
R 4.6.0

The sample statistics are close to but not equal to the population values. That gap is the cost of not measuring everyone.

Bias vs. variability

Two things can go wrong with a sample:

  • Bias — your sampling method systematically misses certain cases. (Surveying only people who answer landlines biases toward older respondents.)
  • Variability — even an unbiased method produces sample statistics that wiggle from one sample to the next.

Big sample sizes shrink variability but not bias. A biased survey of 10 million people is still biased. This is why how you sample matters as much as how much you sample.

The sampling distribution of the mean

If you took many samples of size 50 from our population and computed mean() each time, you'd get a distribution of means. Let's actually do it:

Code Block
R 4.6.0

Three things to notice:

  1. The histogram is bell-shaped, centered on the population mean (~170 cm).
  2. The spread of the sampling distribution (the SD of many_means) matches sd(population) / sqrt(n) — that's the standard error of the mean.
  3. Although one sample bounces around, the distribution of possible sample means is highly predictable.

This regularity — the sampling distribution of x̄ being bell-shaped and narrower than the population — is what makes quantitative inference possible.

The Central Limit Theorem (in plain English)

Here's the magic: even if the underlying population is not bell-shaped, the distribution of sample means becomes bell-shaped as n grows. Let's see this with a very non-normal population — an exponential distribution (lots of small values, a long right tail):

Code Block
R 4.6.0

The population is heavily skewed. With n = 2, the distribution of means still looks skewed. With n = 10, it's much closer to bell-shaped. By n = 50, it's nearly normal.

That is the Central Limit Theorem: the sampling distribution of the mean tends toward a normal distribution as n grows, no matter what shape the population has (with some technical caveats). It's why so many statistical methods quietly assume normality of the sample mean rather than of the data itself.

Why this matters in practice

The CLT is the reason you can:

  • Build confidence intervals for the mean without knowing the exact shape of the population.
  • Run a t-test on right-skewed data when n is large and not feel guilty.
  • Trust the standard error formula sd / sqrt(n) as a useful summary of uncertainty.

It does not help you when:

  • Your sample is biased (no math fixes a bad sampling design).
  • n is small and the population is wildly skewed.
  • You care about something other than the mean (medians, maxima, etc., have their own sampling distributions).

A concrete example: estimating an average rating

Suppose 5,000 customers rated your product 1–5 stars. You can only afford to ask 100 of them. How accurately can you estimate the average rating?

Code Block
R 4.6.0

The "approximate 95% interval" x̄ ± 2·SE is a quick rule of thumb — about 95% of the time, an interval built like this covers the true population mean. (We'll polish that idea on the next page.)

Test your understanding

QuestionSelect one

The standard error of the mean with sample size n and population SD σ is approximately:

Hint: the standard error should get smaller as the sample size n grows.

σ

σ / √n

σ × n

σ² / n

QuestionSelect one

The Central Limit Theorem says:

All data is normally distributed if you collect enough.

The distribution of the sample mean becomes approximately normal as n grows, regardless of the population shape.

Skewed data can be ignored.

p-values are always small.

QuestionSelect one

Why is "sampling 10 million biased people" still a problem?

It's never a problem if n is huge.

Bias is a systematic offset that big sample size does not fix — you'll just be very precisely wrong.

p-values become unreliable.

The CLT stops applying.

Mini challenge: estimate the mean with uncertainty

Given the population vector pop, draw a sample of size 30, compute the sample mean x_bar and the standard error se, and report a rough 95% interval ci of length 2 (low, high) using the rule x̄ ± 2·SE.

Challenge
R 4.6.0
Sample mean + rough 95% CI

Use set.seed(123), take a sample of size 30 from pop, store its mean in x_bar and standard error in se (use sd / sqrt(n)), and build a length-2 numeric vector ci containing x_bar - 2*se and x_bar + 2*se.

Next, we'll unpack what that interval really means — and explore p-values, confidence intervals, and how to keep your intuition honest when reading statistical claims.

On this page