Sampling and Bias
How we choose samples and how sampling goes wrong — random, stratified, cluster, systematic, and convenience sampling, the classic biases that ruin inference, and the all-important distinction between bias and variance.
In Populations and Samples we drew samples with rng.normal(...) and
quietly assumed each one represented the population. This page is about
that assumption — the one that licenses everything else in the course.
How you pick the sample decides whether your estimate is honest. Pick
it well and a few thousand observations can speak for millions. Pick it
badly and a million observations can still point you confidently in the
wrong direction.
That last sentence is the whole point, and it surprises people: more data does not rescue a badly chosen sample. A flawed sampling method bakes in an error that no amount of extra rows removes. Learning to see that error — and to choose samples that avoid it — is one of the highest-value skills in applied statistics.
Why the sampling method is everything
Almost every method in this course assumes your sample is a fair draw from the population. Confidence intervals, p-values, the central limit theorem — all of them are statements about what happens when you sample randomly. Randomness is not a nicety here; it is the mathematical contract that connects the sample you have to the population you care about.
When the sampling mechanism is random, the laws of probability take over and let you say how close your estimate is to the truth. When it isn't random — when the sample is chosen by convenience, by who volunteered, or by who survived — those laws no longer apply, and your tidy confidence interval is quietly measuring the wrong thing.
The word that does all the work: representative
A sample is representative when its composition mirrors the population's on the features that matter. Random sampling doesn't guarantee a representative sample on any single draw, but it makes representativeness the expected outcome and, crucially, lets you quantify how far off you might be. That quantification is the entire game.
Probability sampling methods
The good methods share one trait: every unit has a known, nonzero chance of being selected, set by a random mechanism rather than by you, the respondent, or circumstance. These are called probability sampling methods. Here are the four you'll meet most.
- Simple random sampling (SRS). Every unit is equally likely, and
every subset of size
nis equally likely. This is the gold standard the math assumes. In code it'srng.choice(population, size=n, replace=False). - Stratified sampling. Split the population into non-overlapping groups (strata) — say age bands or regions — then sample within each stratum, usually in proportion to its size. Because you guarantee every group is represented, stratification removes the risk of a random draw accidentally missing a subgroup, and it usually gives a more precise estimate than SRS.
- Cluster sampling. Split the population into many small groups (clusters) — city blocks, schools, stores — then randomly pick whole clusters and measure everyone in the chosen ones. It's cheaper when units are geographically scattered, at the cost of some precision.
- Systematic sampling. Order the population, pick a random starting
point, then take every
k-th unit. Simple and often fine — but dangerous if the ordering has a hidden cycle that lines up withk.
Stratified vs cluster — don't mix them up
Both split the population into groups, but they do opposite things. Stratified: sample within every group (you want all groups represented) → more precise. Cluster: randomly choose some whole groups and skip the rest (you want lower cost) → usually less precise. Stratified divides to include; cluster divides to economize.
The bad one: convenience sampling
Convenience sampling means taking whoever is easiest to reach: the users who happened to be online, the customers who answered the phone, the friends you texted. There's no random mechanism, so the chance of being selected is unknown and unequal — and almost always correlated with the thing you're measuring. That correlation is what turns "easy" into "wrong."
This is the single most common sampling mistake in real data work, precisely because convenience samples are so cheap. Web analytics, opt-in surveys, app-store reviews, and "we asked our power users" studies are all convenience samples wearing a data-science costume.
The random sample misses the truth by a little — that's ordinary sampling noise, and it shrinks if you collect more. The convenience sample misses by a lot in a consistent direction: it overstates satisfaction because happy customers were overrepresented. That consistent, directional miss is bias, and the next demo shows why you can't outrun it.
The punchline: a bigger biased sample does not help
Here is the idea that separates people who understand sampling from
people who don't. Noise shrinks as n grows; bias does not. If your
selection method is skewed, growing the sample just gives you a sharper,
more confident estimate of the wrong number.
Watch the two lines. The random estimate homes in on the dashed truth
line as n grows. The convenience estimate also stabilizes — but on the
wrong value. At n = 100,000 the biased estimate is rock-steady and
badly wrong. It looks precise; it is precisely misleading.
Misconception: 'we have tons of data, so it must be accurate'
Sample size controls noise (variance); sampling method controls bias. A huge sample with a biased method is a high-confidence wrong answer — arguably worse than a small honest one, because its size feels authoritative. The famous 1936 Literary Digest poll mailed 2.4 million responses and still called the U.S. presidential election wrong, because its list (car and telephone owners during the Depression) oversampled the wealthy. Size did not save it. Method sinks it.
Bias vs. variance: the central distinction
Every estimate misses the truth in two fundamentally different ways, and keeping them separate is one of the most important habits in statistics.
- Bias is a systematic miss — the estimate is centered on the wrong value. Averaging many biased samples does not converge to the truth. Bias comes from how you sample (or how you measure). You cannot fix it with more data.
- Variance is a random miss — the estimate scatters around its
center from sample to sample. Variance comes from finite sample size.
You fix it with more data (it shrinks like
1/√n).
The classic picture is a dartboard: bias is where the cluster of darts is centered relative to the bullseye; variance is how spread out the cluster is.
The bottom-left is the dream: a tight cluster on the bullseye. More data moves you leftward (less variance) but cannot move you downward (less bias). Only a better sampling method does that. This is why "just collect more data" is the right instinct for noise and the wrong instinct for bias.
Run it and compare the rows. Going from n=200 to n=5000 shrinks the
spread of both methods dramatically — that's variance falling. But the
convenience method's bias (its center minus the truth) barely budges.
Variance is a sample-size problem; bias is a method problem. They do not
trade off, and one will not fix the other.
An online retailer estimates average customer satisfaction from an opt-in pop-up survey that 0.5% of visitors complete. They worry the estimate is too noisy, so they run the survey for a year and collect 800,000 responses. What's the most likely outcome?
The estimate is now accurate because 800,000 is an enormous sample
The estimate is now very precise but still systematically biased, because opt-in respondents differ from typical customers
The estimate is both unbiased and precise, since a full year covers all customer types
The extra data makes the estimate worse than a small sample would be in every respect
A field guide to sampling bias
Bias has named patterns, and recognizing them by name is half the battle. All of them break the "every unit has a known, fair chance" contract in some specific way.
- Selection bias. The sampling frame itself favors some units. A phone survey at 2pm on a weekday oversamples people who are home at 2pm.
- Self-selection bias. Participants choose themselves in. Online polls, product reviews, and "click here to rate us" all capture people with unusually strong opinions, not typical ones.
- Nonresponse bias. You sampled fairly, but the people who decline differ systematically from those who answer. If busy or dissatisfied customers skip your survey, your responders skew toward the content and available.
- Survivorship bias. You only observe the units that "made it," silently dropping the ones that failed. This one is so important and so sneaky it gets its own section.
- Undercoverage. Part of the population has no chance of being sampled at all — e.g., a web-only survey excludes everyone offline.
Misconception: 'random' means 'haphazard' or 'whatever's handy'
In everyday speech "random" means casual or arbitrary — "I grabbed a
random sample of reviews." In statistics, random sampling is the
opposite of haphazard: it requires a deliberate chance mechanism (a
random number generator, a lottery, rng.choice) where each unit's
probability of selection is known and controlled. Grabbing whatever's
convenient is precisely not random — it's a convenience sample, the
thing that produces bias.
Survivorship bias: the most famous trap
In World War II, the statistician Abraham Wald was asked where to add armor to bombers. The military had data on returning planes and mapped where they were riddled with bullet holes — concentrated on the wings and fuselage. The obvious move: armor the spots with the most holes.
Wald saw the trap. The data came only from planes that returned. The planes hit in the engines and cockpit weren't in the dataset — they didn't come back. The holes showed where a plane could be hit and survive. The armor belonged on the undamaged areas — the places that, when hit, were fatal.
Survivorship bias is everywhere in business data. Mutual-fund performance tables look fantastic partly because funds that did badly get closed and removed from the database — you're seeing only the survivors. "Successful founders dropped out of college" ignores the vast graveyard of dropouts whose startups failed and who were never interviewed. Any time your dataset is "the ones still here," ask what the absent ones would have told you.
The survivors' average is inflated not because the funds got better, but because the bad ones were deleted from the dataset. No statistical test on the surviving funds can recover the truth — the missing data is missing in a systematic, outcome-dependent way. Survivorship bias is a missing data problem disguised as a sampling problem.
A fitness app reports that users who've been active for 2+ years lost an average of 18 pounds. The marketing team wants to claim "our app helps people lose 18 pounds." What's the core problem?
Nothing — long-term users are the most reliable evidence of effectiveness
The sample is too small to generalize
It's survivorship bias: people for whom the app didn't work mostly quit, so the surviving long-term users overstate the typical effect
The 18-pound figure must be a measurement error
Designing the sample: practice
The first challenge makes the bias-vs-noise point quantitatively: you'll measure how much a convenience method is systematically off. The second implements stratified sampling and checks that it estimates the true mean.
A known population of 200,000 satisfaction scores has been created, with its true mean in true_mean. Two sampling schemes are available:
- a simple random sample, and
- a convenience sample that oversamples high scorers (weights
conv_weightsare provided).
To separate bias from noise, draw each method 300 times at sample size n = 800, recording the mean of every sample. Then compute a dict named result with:
"random_bias"— (mean of the 300 random-sample means) minustrue_mean(a float)"conv_bias"— (mean of the 300 convenience-sample means) minustrue_mean(a float)
Use the provided rng for all draws. Both values must be plain Python floats. The convenience bias should be clearly positive (the method systematically overestimates); the random bias should be near zero.
A company has three regions of very different sizes and very different average spend. A DataFrame pop (columns region and spend) holds the entire population, and true_mean is the true overall average spend.
Implement proportional stratified sampling to estimate the overall mean with a total budget of total_n = 600 observations:
- For each region, sample a number of rows proportional to that region's share of the population (use
roundand the providedrng.choiceon the region's row indices,replace=False). - The proportional stratified estimate of the overall mean is simply the mean of all the sampled
spendvalues pooled together (because the per-stratum sample sizes already mirror the population shares).
Produce:
strat_estimate— the pooled mean spend of your stratified sample (a float).abs_error—abs(strat_estimate - true_mean)(a float).
The estimate should land close to true_mean (within about 30).
Why stratified sampling shines here
The three regions have very different mean spend (120 vs 200 vs 320). A simple random sample could, by bad luck, grab too few high-spending West customers and skew low. Proportional stratification guarantees each region appears in its correct proportion, eliminating that source of variability. Same honesty as SRS, usually better precision — which is why pollsters and survey teams stratify by design.
When sampling design matters most
- It matters enormously for surveys, polls, user research, A/B test enrollment, and any time your data is collected by who shows up. These are exactly the situations where self-selection and nonresponse creep in.
- It's subtler but still critical for "found" data — server logs, CRM exports, scraped data. Ask: who or what is missing from this table, and is their absence related to what I'm measuring? (That question catches survivorship and undercoverage.)
- It matters less when you genuinely observe the whole population (a true census) — but then ask whether your "population" is really the one you care about, or just the slice your systems happen to log.
The two-sentence summary of this page
Variance is a sample-size problem you fix with more data. Bias is a sampling-method problem you cannot fix with any amount of data. When an estimate feels off, your first question shouldn't be "do I have enough rows?" — it should be "how was this sample selected, and who's missing?"
Check your understanding
Which scenario describes bias rather than variance?
A poll of 50 random voters gives a wide margin of error
Two random samples from the same population give slightly different means
An opt-in web survey consistently overestimates customer satisfaction no matter how many responses come in
A sample mean happens to fall a bit above the population mean this time
A data scientist says: "Our sample of 5 million log events is so large that it's basically the whole population, so there's no sampling concern." When is this reasoning most dangerous?
When the 5 million events are a true random draw from all events
When all events are identical
When the 5 million events come only from a non-representative subgroup (e.g., only logged-in power users)
When the events were collected over a full calendar year
You want to estimate average household income in a country with distinct regions of differing wealth, and you want to guarantee every region is represented in proportion to its population. Which method fits best?
Convenience sampling at the busiest train station
Stratified sampling with regions as strata, sampling each in proportion to its size
Systematic sampling from an alphabetical list of surnames
Cluster sampling that randomly keeps only 2 of the regions
An investor looks at a database of currently-listed mutual funds, sees an average historical return of 9%, and concludes the typical fund returns 9%. What bias is at play, and can a larger database fix it?
Nonresponse bias, fixable by following up with fund managers
Selection bias, fixable by sampling more current funds
Survivorship bias, and a larger database of surviving funds cannot fix it because the failed funds are systematically absent
Pure variance, fixable by averaging over more years
Why does random sampling (a deliberate chance mechanism) license statistical inference, while haphazard convenience sampling does not?
Because random samples are always larger than convenience samples
Because a known random mechanism gives every unit a known selection probability, which is what the math of inference (margins of error, p-values) is built on
Because convenience samples always contain measurement errors
Because random samples never contain bias of any kind
Key takeaways
- How you sample decides whether inference is honest; the math assumes a random mechanism.
- Probability methods (simple random, stratified, cluster, systematic) give every unit a known selection chance. Convenience sampling does not, and usually introduces bias.
- Bias is a systematic, directional miss from method — it does not shrink with more data. Variance is random scatter from sample size — it does shrink (like
1/√n). - Classic biases: selection, self-selection, nonresponse, survivorship, undercoverage. Always ask who is missing.
- A bigger biased sample is a more confident wrong answer. "Random" means a deliberate chance mechanism, never "haphazard."
With an honest sample in hand, we can finally study how a statistic behaves across many such samples. That's Sampling Distributions — the single most important idea in the course, and the bridge to the Central Limit Theorem, Standard Error, and Confidence Intervals.
Working with Distributions
The practical scipy.stats toolkit — the four-questions-to-four-methods map, the loc/scale convention and frozen distributions, fitting a distribution to data with .fit, and sanity-checking the fit with histogram overlays and Q-Q plot intuition.
Sampling Distributions
The distribution of a statistic over many repeated samples — the single most important and most-confused idea in inference. Carefully separating the population distribution, one sample's distribution, and the sampling distribution of a statistic.