The Bias–Variance Tradeoff

In the last chapter we saw overfitting and underfitting as two ends of a complexity dial. This chapter gives them their proper names — variance and bias — and explains the deep reason you cannot simply eliminate both at once. Understanding this tradeoff is what separates someone who can run models from someone who can reason about them.

Error has parts

When a model makes mistakes on new data, that error comes from three sources that add together:

Bias is error from the model's assumptions being too simple for the truth. A straight-line model trying to fit a curve has high bias: no matter what data you feed it, it is systematically wrong in the same way.
Variance is error from the model being too sensitive to the particular training sample. A super-flexible model will fit a completely different shape if you hand it a slightly different batch of data.
Irreducible noise is the randomness baked into the problem itself — measurement error, genuine unpredictability. No model can remove it, so it sets a floor on how good any model can be.

Written compactly, the expected error of a model at a point decomposes into exactly these three pieces:

\mathbb{E}\big[(y - \hat{f}(x))^2\big] = \underbrace{\text{Bias}\,[\hat{f}(x)]^2}_{\text{too simple}} + \underbrace{\text{Var}\,[\hat{f}(x)]}_{\text{too sensitive}} + \underbrace{\sigma^2}_{\text{irreducible}}

You control the first two. The whole game is trading them off.

Bias and variance, in one breath

Bias = how far off the model is on average. Variance = how much the model jumps around when the training data changes. Underfitting is mostly bias; overfitting is mostly variance.

The dartboard picture

The classic way to feel bias and variance is to imagine throwing darts at a bullseye. Each dart is the model's prediction; the bullseye is the truth.

Low bias, low variance (top-left): darts cluster tightly on the bullseye. This is the dream — and usually unattainable because of irreducible noise.
High bias, low variance (bottom-left): darts cluster tightly but off-center. Consistent, consistently wrong. This is underfitting.
Low bias, high variance (top-right): darts center on the bullseye on average, but scatter wildly. Any single throw could be far off. This is overfitting.
High bias, high variance (bottom-right): the worst of both.

The deep insight: a tight cluster is not good if it is in the wrong place, and being right on average is not good if any individual prediction could be anywhere.

Watching variance with your own eyes

Variance is about how much a model changes when the training data changes. Let us make that literal: draw many different small training sets from the same true function, fit a model on each, and overlay all the fitted curves. A high-variance model produces a chaotic spaghetti of curves; a low-variance model produces nearly the same curve every time.

The left panel (degree 1) shows gray lines that barely move — low variance — but their average (orange) cannot follow the green truth — high bias. The right panel (degree 15) shows gray lines thrashing all over the place — high variance — even though their average tracks the truth reasonably well — low bias. Same data, opposite failure modes.

Why you cannot just have both

Make a model more flexible and you lower its bias (it can match more shapes) but raise its variance (it reacts more to each sample's noise). Make it simpler and you do the reverse. Pushing one down tends to push the other up — that is the tradeoff. The goal is not zero bias or zero variance, but the lowest total error.

How the tradeoff connects to everything else

This single idea reframes much of the course:

Underfitting = high bias. The model is too simple. Fix by adding capacity or better features.
Overfitting = high variance. The model is too sensitive. Fix by simplifying, regularizing, or adding data.
Regularization is a dial that deliberately adds a little bias to buy a large reduction in variance.
Ensembles like random forests are a variance-reduction machine: averaging many high-variance trees cancels out their individual jitter while keeping their low bias. That is why a forest usually beats a single deep tree.

More data changes the picture

Here is a hopeful fact: variance shrinks as you add training data. With more examples, the noise averages out, so a flexible model is pulled back toward the true pattern instead of chasing individual points. Bias, by contrast, is a property of the model family — more data will not fix a straight line trying to be a curve.

This gives a practical rule of thumb: if you are overfitting (high variance) and can get more data, do it. If you are underfitting (high bias), more of the same data will not help — you need a richer model or better features.

Diagnose before you treat

More data cures variance, not bias. A more complex model cures bias, not variance. Diagnosing which problem you have (via the train–test gap from the previous chapter) tells you which lever to pull. Pulling the wrong one wastes effort.

Common misconceptions

"Low variance is always good." Not if it comes with high bias — a model that confidently gives the same wrong answer every time has low variance and is useless.
"You should minimize bias." You should minimize total error. A bit of bias (via regularization) is often worth it for a big cut in variance.
"Bias here means social/ethical bias." In this context bias is a statistical term about model assumptions. It is unrelated to fairness bias (though models can have both kinds — do not confuse them).
"Ensembles reduce bias." Bagging and forests mainly reduce variance. Lowering bias is the job of boosting or of using a more expressive base model.

Real-world applications

A weather model that is consistently 5 degrees too high has bias; one whose forecast swings wildly with tiny changes in inputs has variance. A recommendation system trained on too few users will overfit their quirks (variance); one using only a coarse "popular items" rule will miss individual taste (bias). Every modeling decision — how complex, how regularized, how much data — is implicitly a choice on this tradeoff.

Your turn

You will estimate the variance of two models by retraining each on many random training sets and measuring how much their prediction at a single fixed point bounces around.

A helper make_dataset() returns a fresh noisy (X, y) each call. The fixed evaluation point is x_star (shape (1, 1)).

For each model in turn (a degree-1 pipeline = simple, a degree-15 pipeline = flexible):

Refit it on 60 fresh datasets from make_dataset().
Collect its scalar prediction at x_star each time.
Store the variance of those 60 predictions: var_simple for the degree-1 model and var_flexible for the degree-15 model (use np.var(...)).

The tests check both variances are computed and that var_flexible > var_simple — the flexible model is far less stable.

Check your understanding

QuestionSelect one

In bias–variance terms, what does bias measure?

How much predictions change when the training data changes

The amount of random noise in the labels

How far the model's predictions are from the truth on average, due to overly simple assumptions

The number of features in the model

QuestionSelect one

A model whose predictions swing dramatically when you retrain it on a slightly different sample has:

High bias

High variance

Zero irreducible error

Perfect generalization

QuestionSelect one

Why is it usually impossible to drive both bias and variance to zero by adjusting model complexity?

Because scikit-learn limits model complexity

Because increasing complexity lowers bias but raises variance, and decreasing it does the reverse — they trade off

Because variance is always larger than bias

Because more data removes bias

QuestionSelect one

You are clearly overfitting (high variance) and can collect more labeled data. What should you expect?

More data will increase variance

More data typically reduces variance, pulling a flexible model back toward the true pattern

More data will fix bias but not variance

Nothing changes with more data

QuestionSelect one

Why does a random forest usually outperform a single deep decision tree?

It has higher bias than a single tree

It uses fewer features

Averaging many high-variance trees cancels out their individual noise, cutting variance while keeping the low bias of deep trees

It is guaranteed to reach zero error

Error has parts

The dartboard picture

Watching variance with your own eyes

How the tradeoff connects to everything else

More data changes the picture

Common misconceptions

Real-world applications

Your turn

Check your understanding

The Bias–Variance Tradeoff

On this page