Generalization, Overfitting, and Underfitting
The central tension of machine learning — a model must be flexible enough to learn the pattern but disciplined enough not to memorize the noise.
A model's job is not to describe the data it was trained on. Its job is to work on data it has never seen. That ability — to perform well on new, unseen examples — is called generalization, and it is the entire point of machine learning. Everything in this chapter is about the two ways generalization fails, how to recognize them, and what to do about them.
What generalization really means
When you train a model, you show it examples and it adjusts itself to explain them. But the training examples are just a finite sample drawn from some larger reality. The pattern you actually care about lives in that reality, not in your particular sample. A model that generalizes has captured the underlying pattern. A model that fails to generalize has latched onto quirks of the sample that will not repeat.
The two failure modes sit on opposite ends of a single dial — model complexity. Turn it too low and the model is too rigid to capture the pattern (underfitting). Turn it too high and the model is flexible enough to memorize noise (overfitting). The art is finding the middle.
A one-sentence definition
Generalization is the gap between how well a model does on data it trained on and how well it does on data it did not. A small gap, at a good level of performance, is the goal.
Seeing both failures with one example
The cleanest way to build intuition is to watch a model fit a curve. We will generate data from a gently wiggling true function, add a little noise, and then fit polynomials of increasing flexibility. A degree-1 polynomial is a straight line (very rigid). A degree-15 polynomial can bend almost anywhere (very flexible).
Look at the three panels:
- Degree 1 (underfit). The straight line is too rigid to follow the curve. It is wrong in a systematic way — consistently above the data in some regions, below it in others. This is high bias.
- Degree 4 (good fit). The curve tracks the true green pattern closely without chasing individual dots. This generalizes.
- Degree 15 (overfit). The wild orange curve passes through nearly every training dot — including the noisy ones — and lurches violently between them. It has memorized the sample, not the pattern. This is high variance.
Overfitting looks like success on the training data
The degree-15 model has the lowest error on the training points — it nearly touches every one. If training error were your scorecard, you would pick the worst model. This is exactly why we hold data back.
The complexity–error curve
Let us quantify what the eye just saw. We will sweep the polynomial degree from rigid to flexible and, at each step, measure error on the training data and on a held-out test set.
This plot is one of the most important pictures in all of machine learning. Notice the two curves behave completely differently:
- Training error (blue) falls monotonically. More flexibility always lets the model fit the training data better. Training error alone can never warn you about overfitting — it only ever improves.
- Test error (orange) is U-shaped. It drops as the model gains enough flexibility to capture the real pattern, bottoms out at the sweet spot, then rises as the model starts fitting noise.
The bottom of the U is the model you want. To the left of it you are underfitting; to the right you are overfitting.
The single most useful diagnostic
Compare training performance to test performance. A small gap with good scores means you are generalizing. A large gap (great on train, poor on test) means overfitting. Both scores poor and close together means underfitting. You cannot diagnose either failure from the training score alone.
The same dial on a different model
Polynomial degree is one complexity dial; every model family has its own.
For a decision tree, the dial is max_depth. Watch the identical pattern
emerge.
Different algorithm, same story. As the tree is allowed to grow deeper, its
training accuracy climbs toward a perfect 1.0 while its test accuracy
rises, flattens, and then sags. Deep trees memorize; shallow trees
generalize but may be too simple. The right depth lives in between.
Underfitting: the quieter failure
Overfitting gets all the attention because it is dramatic, but underfitting is just as real and easier to miss. An underfit model is too simple to capture the pattern, so it is wrong even on the training data. Symptoms:
- Training score is mediocre (not just test score).
- Training and test scores are similar — there is no gap to chase.
- Adding more training data does not help; the model lacks the capacity to use it.
The fix for underfitting is the opposite of the fix for overfitting: give the model more capacity (a higher-degree polynomial, a deeper tree, more/better features) rather than less.
A quick mental flowchart
Poor on training and test, scores close → underfitting (add capacity or better features). Great on training, much worse on test → overfitting (reduce capacity, regularize, or get more data). Good on both, small gap → you are done.
How to fight overfitting
When you have diagnosed overfitting, the toolbox is:
- Use a simpler model — fewer features, lower degree, shallower tree,
stronger regularization (e.g. a smaller
CinLogisticRegression). - Get more training data. With enough examples, the noise averages out and even a flexible model struggles to memorize it.
- Regularize. Penalize complexity directly (ridge/lasso for linear
models,
max_depth/min_samples_leaffor trees). - Hold out and cross-validate honestly so you actually notice the gap instead of celebrating the training score.
We will not chase every knob here — later chapters cover regularization and cross-validation. The goal of this page is the diagnosis: knowing what overfitting and underfitting look like, and that the test set is how you tell them apart.
Common misconceptions
- "A model that scores 100% is the best model." On training data, a perfect score is usually a symptom of overfitting, not excellence.
- "Overfitting means the model is too big." Not exactly — it means the model is too flexible relative to the amount and cleanliness of the data. The same model can overfit a tiny dataset and generalize on a large one.
- "If test error is higher than training error, something is broken." A small gap is normal and expected — the model has a slight home-field advantage on data it trained on. Only a large gap signals trouble.
- "More features always help." Each extra feature is another dimension in which the model can memorize noise. Irrelevant features make overfitting easier, not harder.
Real-world applications
This is not academic. A credit-scoring model that overfits its historical applicants will approve the wrong people next quarter. A medical model that memorizes the quirks of one hospital's scanner will fail at the hospital across town. A demand forecaster that underfits will miss every seasonal swing. Teams that win in production are the ones that obsessively measure the train–test gap before shipping.
Your turn
You are given a noisy 1-D regression dataset already split into
train and test sets (X_train, X_test, y_train, y_test).
For each polynomial degree in degrees = range(1, 11):
- Build
make_pipeline(PolynomialFeatures(degree), LinearRegression()). - Fit it on the training set.
- Record the test mean squared error in a list called
test_errors(same order asdegrees).
Then set best_degree to the degree with the lowest test error.
The tests check that test_errors has 10 entries and that best_degree
is the argmin (it should land between 3 and 6, not at 1 and not at 10).
Check your understanding
As you increase a model's complexity, what happens to its error on the training data?
It rises steadily
It is U-shaped, falling then rising
It tends to fall continuously — more flexibility always lets the model fit the training data better
It stays perfectly flat
A model scores 0.99 on training data and 0.78 on test data. What is happening?
Underfitting
Overfitting — the large gap between strong training and weaker test performance is its signature
The model generalizes perfectly
The test set must be broken
A model scores 0.71 on training data and 0.70 on test data. Which description fits best?
Overfitting
Likely underfitting — both scores are mediocre and close together, suggesting the model is too simple to capture the pattern
Perfect generalization, nothing to improve
Data leakage
Which change would you try first to reduce overfitting in a decision tree?
Increase max_depth
Add more irrelevant features
Decrease max_depth (or otherwise constrain the tree), making the model simpler
Train on the test set as well
Why is "the model achieved 100% accuracy on the training data" usually a warning sign rather than good news?
Because 100% accuracy is mathematically impossible
Because a perfect training score often means the model memorized the training data, including its noise, and will likely generalize poorly
Because scikit-learn caps accuracy at 99%
Because it always means the labels are wrong
Underfitting and overfitting are best described as:
Two unrelated bugs
The same thing under different names
Opposite ends of the model-complexity dial — too simple versus too flexible — with good generalization in between
Problems that only occur in deep learning
Your First Model, End to End
The complete machine learning workflow on one tiny dataset — load, split, train, evaluate, predict — so the shape of every future model becomes second nature.
The Bias–Variance Tradeoff
A deeper look at why models fail — splitting error into the part caused by wrong assumptions and the part caused by oversensitivity to the training sample.