Generalization, Overfitting, and Underfitting

A model's job is not to describe the data it was trained on. Its job is to work on data it has never seen. That ability — to perform well on new, unseen examples — is called generalization, and it is the entire point of machine learning. Everything in this chapter is about the two ways generalization fails, how to recognize them, and what to do about them.

What generalization really means

When you train a model, you show it examples and it adjusts itself to explain them. But the training examples are just a finite sample drawn from some larger reality. The pattern you actually care about lives in that reality, not in your particular sample. A model that generalizes has captured the underlying pattern. A model that fails to generalize has latched onto quirks of the sample that will not repeat.

The two failure modes sit on opposite ends of a single dial — model complexity. Turn it too low and the model is too rigid to capture the pattern (underfitting). Turn it too high and the model is flexible enough to memorize noise (overfitting). The art is finding the middle.

A one-sentence definition

Generalization is the gap between how well a model does on data it trained on and how well it does on data it did not. A small gap, at a good level of performance, is the goal.

Seeing both failures with one example

The cleanest way to build intuition is to watch a model fit a curve. We will generate data from a gently wiggling true function, add a little noise, and then fit polynomials of increasing flexibility. A degree-1 polynomial is a straight line (very rigid). A degree-15 polynomial can bend almost anywhere (very flexible).

Look at the three panels:

Degree 1 (underfit). The straight line is too rigid to follow the curve. It is wrong in a systematic way — consistently above the data in some regions, below it in others. This is high bias.
Degree 4 (good fit). The curve tracks the true green pattern closely without chasing individual dots. This generalizes.
Degree 15 (overfit). The wild orange curve passes through nearly every training dot — including the noisy ones — and lurches violently between them. It has memorized the sample, not the pattern. This is high variance.

Overfitting looks like success on the training data

The degree-15 model has the lowest error on the training points — it nearly touches every one. If training error were your scorecard, you would pick the worst model. This is exactly why we hold data back.

The complexity–error curve

Let us quantify what the eye just saw. We will sweep the polynomial degree from rigid to flexible and, at each step, measure error on the training data and on a held-out test set.

This plot is one of the most important pictures in all of machine learning. Notice the two curves behave completely differently:

Training error (blue) falls monotonically. More flexibility always lets the model fit the training data better. Training error alone can never warn you about overfitting — it only ever improves.
Test error (orange) is U-shaped. It drops as the model gains enough flexibility to capture the real pattern, bottoms out at the sweet spot, then rises as the model starts fitting noise.

The bottom of the U is the model you want. To the left of it you are underfitting; to the right you are overfitting.

The single most useful diagnostic

Compare training performance to test performance. A small gap with good scores means you are generalizing. A large gap (great on train, poor on test) means overfitting. Both scores poor and close together means underfitting. You cannot diagnose either failure from the training score alone.

The same dial on a different model

Polynomial degree is one complexity dial; every model family has its own. For a decision tree, the dial is max_depth. Watch the identical pattern emerge.

Different algorithm, same story. As the tree is allowed to grow deeper, its training accuracy climbs toward a perfect 1.0 while its test accuracy rises, flattens, and then sags. Deep trees memorize; shallow trees generalize but may be too simple. The right depth lives in between.

Underfitting: the quieter failure

Overfitting gets all the attention because it is dramatic, but underfitting is just as real and easier to miss. An underfit model is too simple to capture the pattern, so it is wrong even on the training data. Symptoms:

Training score is mediocre (not just test score).
Training and test scores are similar — there is no gap to chase.
Adding more training data does not help; the model lacks the capacity to use it.

The fix for underfitting is the opposite of the fix for overfitting: give the model more capacity (a higher-degree polynomial, a deeper tree, more/better features) rather than less.

A quick mental flowchart

Poor on training and test, scores close → underfitting (add capacity or better features). Great on training, much worse on test → overfitting (reduce capacity, regularize, or get more data). Good on both, small gap → you are done.

How to fight overfitting

When you have diagnosed overfitting, the toolbox is:

Use a simpler model — fewer features, lower degree, shallower tree, stronger regularization (e.g. a smaller C in LogisticRegression).
Get more training data. With enough examples, the noise averages out and even a flexible model struggles to memorize it.
Regularize. Penalize complexity directly (ridge/lasso for linear models, max_depth/min_samples_leaf for trees).
Hold out and cross-validate honestly so you actually notice the gap instead of celebrating the training score.

We will not chase every knob here — later chapters cover regularization and cross-validation. The goal of this page is the diagnosis: knowing what overfitting and underfitting look like, and that the test set is how you tell them apart.

Common misconceptions

"A model that scores 100% is the best model." On training data, a perfect score is usually a symptom of overfitting, not excellence.
"Overfitting means the model is too big." Not exactly — it means the model is too flexible relative to the amount and cleanliness of the data. The same model can overfit a tiny dataset and generalize on a large one.
"If test error is higher than training error, something is broken." A small gap is normal and expected — the model has a slight home-field advantage on data it trained on. Only a large gap signals trouble.
"More features always help." Each extra feature is another dimension in which the model can memorize noise. Irrelevant features make overfitting easier, not harder.

Real-world applications

This is not academic. A credit-scoring model that overfits its historical applicants will approve the wrong people next quarter. A medical model that memorizes the quirks of one hospital's scanner will fail at the hospital across town. A demand forecaster that underfits will miss every seasonal swing. Teams that win in production are the ones that obsessively measure the train–test gap before shipping.

Your turn

You are given a noisy 1-D regression dataset already split into train and test sets (X_train, X_test, y_train, y_test).

For each polynomial degree in degrees = range(1, 11):

Build make_pipeline(PolynomialFeatures(degree), LinearRegression()).
Fit it on the training set.
Record the test mean squared error in a list called test_errors (same order as degrees).

Then set best_degree to the degree with the lowest test error.

The tests check that test_errors has 10 entries and that best_degree is the argmin (it should land between 3 and 6, not at 1 and not at 10).

Check your understanding

QuestionSelect one

As you increase a model's complexity, what happens to its error on the training data?

It rises steadily

It is U-shaped, falling then rising

It tends to fall continuously — more flexibility always lets the model fit the training data better

It stays perfectly flat

QuestionSelect one

A model scores 0.99 on training data and 0.78 on test data. What is happening?

Underfitting

Overfitting — the large gap between strong training and weaker test performance is its signature

The model generalizes perfectly

The test set must be broken

QuestionSelect one

A model scores 0.71 on training data and 0.70 on test data. Which description fits best?

Overfitting

Likely underfitting — both scores are mediocre and close together, suggesting the model is too simple to capture the pattern

Perfect generalization, nothing to improve

Data leakage

QuestionSelect one

Which change would you try first to reduce overfitting in a decision tree?

Increase max_depth

Add more irrelevant features

Decrease max_depth (or otherwise constrain the tree), making the model simpler

Train on the test set as well

QuestionSelect one

Why is "the model achieved 100% accuracy on the training data" usually a warning sign rather than good news?

Because 100% accuracy is mathematically impossible

Because a perfect training score often means the model memorized the training data, including its noise, and will likely generalize poorly

Because scikit-learn caps accuracy at 99%

Because it always means the labels are wrong

QuestionSelect one

Underfitting and overfitting are best described as:

Two unrelated bugs

The same thing under different names

Opposite ends of the model-complexity dial — too simple versus too flexible — with good generalization in between

Problems that only occur in deep learning

What generalization really means

Seeing both failures with one example

The complexity–error curve

The same dial on a different model

Underfitting: the quieter failure

How to fight overfitting

Common misconceptions

Real-world applications

Your turn

Check your understanding

Generalization, Overfitting, and Underfitting

On this page