Populations and Samples

The population–sample distinction at the heart of inference — parameters you can't observe, statistics you compute to estimate them, and why bigger samples sharpen the estimate rather than change the target.

In Statistical Thinking we said your data is one noisy slice of a larger reality. This page gives that idea its proper vocabulary — population and sample, parameter and statistic — and makes one arrow precise: you compute a statistic from a sample to estimate a parameter of a population you can't fully see. Almost every method in the rest of the course is a tool for drawing that arrow honestly.

The distinction sounds almost too simple to dwell on, yet confusing these four words is behind a huge share of real analytical mistakes: reporting a sample average as if it were the population's, expecting a bigger sample to "change" the answer, or forgetting that a statistic wobbles every time you collect new data. Let's pin them down.

Population vs. sample

The population is the entire collection of units you actually care about — every customer, every transaction this design could ever generate, every wafer the machine could ever produce. The sample is the subset you managed to observe and put in a DataFrame.

The loop is the whole idea: the parameter lives in the population (top right), you can't see it directly, so you draw a sample, compute a statistic, and use that statistic to estimate the parameter. The dotted arrows are inference — uncertain by nature. The solid arrows are things you actually do (sampling and computing), which are concrete and exact.

The population is usually conceptual, not a list

It's natural to imagine the population as a giant spreadsheet you could download if only you had time. Sometimes it is (every employee on the payroll right now). But far more often the population is conceptual and effectively infinite: "all checkout sessions this flow could ever produce," "every future sensor reading from this machine." You're estimating a property of a process, not counting a finite pile. That's why "just collect all of it" usually isn't even possible in principle.

Why we sample instead of taking a census

A census measures every unit in the population. When you can do one cheaply, do it — there's no uncertainty to estimate. But for most data-science questions a census is impossible, impractical, or pointless:

Cost and time. Surveying all 4 million users, or load-testing every possible session, is wildly expensive when 2,000 well-chosen observations answer the question.
The population is infinite or future-facing. You literally cannot measure "all sessions this design will ever serve" — most of them haven't happened yet.
Measurement destroys the unit. A factory testing battery lifetime to failure can't ship the batteries it tested. Destructive testing forces sampling.
A good sample is enough. This is the punchline of the whole course: a modest, representative sample estimates a population parameter with quantifiable precision. You don't need everything — you need enough, collected well.

Sampling is a feature, not a compromise

Sampling isn't a sad approximation you settle for. It's the entire basis of efficient measurement: a 1,500-person poll can speak for a nation because statistics tells you exactly how precise that estimate is. The catch is that the sample must be collected without bias — the subject of Sampling and Bias.

Parameters vs. statistics

This is the distinction to burn into memory.

A parameter is a number that describes the population. It's fixed (the process has some true value) but unknown — you can't see it. Parameters get Greek letters.
A statistic is a number you compute from the sample. It's known (you can calculate it) but varies — a different sample gives a different value. Statistics get Latin letters or "hats."

Quantity	Population parameter (unknown, fixed)	Sample statistic (known, varies)
Mean	μ ("mu")	x̄ ("x-bar")
Standard deviation	σ ("sigma")	`s`
Proportion	`p`	p̂ ("p-hat")
Size	`N`	`n`
Correlation	ρ ("rho")	`r`

A clean way to keep them straight: Greek = the truth you're after; Latin/hat = your best guess from data. The statistic is the estimator; the parameter is the estimand. You use x̄ to estimate μ, p̂ to estimate p, and so on.

The #1 confusion: reporting a statistic as a parameter

Writing "average revenue per user is $48.20" — full stop — quietly claims you measured the population. You didn't. You measured x̄ = 48.20 from a sample, and the population μ is near it but uncertain. The honest sentence names the estimate and its error: "x̄ = $48.20, and μ is plausibly within $45–$51." Collapsing the statistic and the parameter into one number is how overconfident claims are born.

Seeing it in code: a known population

Here's the trick that makes this concrete. In the real world the population is hidden, so you never get to check your estimate. But in code we can play god: define a population with known parameters, then draw samples and watch the statistics estimate the (now visible) truth.

The estimate misses by a little — that miss is estimation error, and it's unavoidable from a finite sample. Crucially, the error isn't a mistake you made; it's the price of not seeing the whole population.

ddof=1 for a sample standard deviation

NumPy's .std() defaults to ddof=0 (the population formula, dividing by n). When your data is a sample estimating a population σ, use ddof=1 (dividing by n-1). Pandas' .std() already uses ddof=1 by default. The correction matters most for small samples; we explain why in Measures of Spread.

A statistic is a random variable

Here's the idea that unlocks everything later. Because a statistic depends on which sample you happened to draw, it changes from sample to sample — which means a statistic is itself a random variable with its own distribution. The sample mean isn't a fixed number; it's a number that would have come out differently with a different sample.

Two things to notice, because they preview the next several pages. First, the sample means pile up around the true μ — the statistic is "aimed" at the parameter (it's unbiased). Second, they have a spread of their own. That spread — how much an estimate wobbles sample-to-sample — is the standard error, and the whole distribution of a statistic across samples is a sampling distribution. We devote entire pages to them: Sampling Distributions and Standard Error.

QuestionSelect one

You compute the mean salary of a random sample of 200 employees and get $72,400. A teammate says "so the average salary at the company is $72,400." What's the precise issue?

Nothing — the sample mean is the company average

The sample is too small for any conclusion

$72,400 is the sample statistic $\bar{x}$ , an estimate of the unknown population parameter $\mu$ ; the company's true average is near it but uncertain

The number is wrong because salaries aren't normally distributed

Estimates have error — and it shrinks as n grows

Estimation error is unavoidable, but it's not uncontrollable. The most important lever is sample size: as n grows, your statistic homes in on the parameter. Watch the sample mean converge.

The error trends toward zero as n grows (it's jagged because each sample is still random, but the envelope shrinks). A key, often-missed detail: precision improves with √n, not n. To halve your error you need roughly four times the data. That diminishing return is exactly why a national poll uses ~1,500 people, not 1,500,000 — past a point, more data buys very little extra precision. We'll quantify this with the standard error formula σ / √n in Standard Error.

Misconception: a bigger sample changes the population

A bigger sample does not change the population or its parameters — μ, σ, and p are properties of the population and don't care how much you sampled. What grows with n is the precision of your estimate: x̄ clusters more tightly around the unchanged μ. The target stays put; your aim gets steadier.

QuestionSelect one

You increase your survey from 500 to 5,000 respondents. Which statement is correct?

The population proportion $p$ will move closer to your estimate

The population proportion $p$ is unchanged; your estimate $\hat{p}$ becomes more precise (less sample-to-sample variability)

Both $p$ and $\hat{p}$ stay exactly the same

The estimate's precision improves 10x because the sample grew 10x

Practice the population–sample loop

A known population of daily order counts has been created for you with true parameters MU_TRUE and SIGMA_TRUE (you may use these only to check your work, not to compute your estimates).

A single sample of size 120 has been drawn. From the sample only, produce a dict named estimates with:

"x_bar" — the sample mean (a float)
"s" — the sample standard deviation using ddof=1 (a float)
"mean_error" — the signed estimation error x_bar - MU_TRUE (a float)

Use sample.mean() and sample.std(ddof=1). All three values must be plain Python floats.

A population of users either churned (1) or stayed (0). The true churn proportion is stored in P_TRUE (use it only to compute the error, not the estimate).

A sample of 400 users (an array of 0s and 1s) has been drawn. Produce:

p_hat — the sample proportion of churners, i.e. the mean of the 0/1 sample (a float).
abs_error — the absolute error abs(p_hat - P_TRUE) (a float).

For a 0/1 array, the proportion of 1s is just its mean. Make both values plain Python floats.

What both challenges quietly demonstrate

You computed statistics (x̄, s, p̂) from a sample and compared them to parameters (μ, σ, p) you'd normally never see. In real work the parameter stays hidden — so instead of measuring the error directly, you estimate it with a standard error or a confidence interval. That's the leap we make in Standard Error and Confidence Intervals.

Check your understanding

QuestionSelect one

Which of these is a parameter (as opposed to a statistic)?

The mean of the 1,200 rows in your loaded DataFrame

The true average lifetime of every battery this factory will ever produce

The proportion of survey respondents who answered "yes"

The standard deviation of last week's recorded sensor readings

QuestionSelect one

Why is it usually impossible, even in principle, to compute a population parameter directly for a data-science question like "does this checkout flow convert well"?

Because parameters are always irrational numbers

Because Pandas can't hold that many rows

Because the population is conceptual and partly in the future — it includes all sessions the flow could generate, most of which haven't happened

Because measurement always changes the parameter's value

QuestionSelect one

A statistic like the sample mean is best described as:

A fixed constant that's the same for any sample from the population

A random variable: its value depends on which sample you drew, so it has a distribution across possible samples

An exact copy of the population parameter it estimates

A value that only exists once you've measured the entire population

QuestionSelect one

Your estimate of a population mean isn't precise enough. You quadruple the sample size from 250 to 1,000. Roughly how much does the typical estimation error shrink, and why?

It shrinks to zero, because 1,000 is large enough to equal the population

It shrinks by a factor of 4, in proportion to the sample size

It shrinks by about a factor of 2, because precision scales with $\sqrt{n}$ and $\sqrt{4} = 2$

It doesn't change, because the population mean is fixed

QuestionSelect one

Pick the statement that correctly uses the notation.

$\bar{x}$ is the population mean and $\mu$ is the sample mean

$\hat{p}$ is the sample proportion and $p$ is the population proportion it estimates

$\sigma$ is computed from your sample and $s$ is the unknown population value

$N$ is your sample size and $n$ is the population size

Key takeaways

The population is everything you care about (often conceptual/infinite); the sample is what you observed.
A parameter (μ, σ, p) describes the population — fixed but unknown. A statistic (x̄, s, p̂) is computed from the sample — known but varies.
You estimate parameters with statistics; the gap between them is unavoidable estimation error.
A statistic is a random variable: it changes sample to sample and has its own distribution.
Bigger n sharpens the estimate (precision ∝ √n); it does not change the population or its parameters.
We sample because a census is usually impossible, costly, or unnecessary — a well-collected sample is enough.

Next, Sampling and Bias tackles the part this page assumed away — getting a sample that actually represents the population. Then Standard Error quantifies how much an estimate wobbles, and Confidence Intervals turns that wobble into an honest range for the parameter you can't see.

Populations and Samples

On this page