Hyperparameter Tuning
The difference between what a model learns and what you choose for it — and how to choose well without quietly cheating on the test set.
Every model you have built so far had two kinds of knobs, even if you only noticed one of them. Some knobs the model turns itself, automatically, while it learns from data. Others you turn, by hand, before training even begins. The first kind are parameters. The second kind are hyperparameters, and choosing them well is the subject of this page.
You have already met several hyperparameters without naming them as a
group: the n_neighbors=5 in KNeighborsClassifier, the max_depth you
used to tame an overfitting decision tree, the C you can hand to
LogisticRegression. None of those values were learned from the data. You
picked them. This page is about picking them deliberately, with evidence,
instead of guessing — and about the one mistake that can make a careful
tuning process worse than useless.
Parameters vs. hyperparameters
The distinction sounds like pedantry, but it is the whole foundation of this topic, so let us make it concrete.
A parameter is a number the model discovers by fitting to data. You
never set it; .fit() computes it. When a LinearRegression finds the
slope and intercept of a line, those coefficients are parameters. When a
DecisionTreeClassifier decides "split on feature 7 at the value 2.6,"
that threshold is a parameter. A trained random forest can hold tens of
thousands of parameters, all derived from the training rows. You could not
write them down by hand if you tried.
A hyperparameter is a number you fix before fitting, which shapes how the learning happens. It is not read from the data; it is part of the recipe. The model cannot learn it from the same data it is fitting, because the hyperparameter is the thing that decides what "fitting" even means.
A two-word test
Ask yourself: who set this value? If the answer is ".fit() did, from
the data," it is a parameter. If the answer is "I did, before training,"
it is a hyperparameter. The coefficients of a regression are parameters;
the C that controls how hard that regression is regularized is a
hyperparameter.
Here are the hyperparameters you have already touched, lined up against the parameters they govern:
| Model | A hyperparameter (you set it) | A parameter (the model learns it) |
|---|---|---|
KNeighborsClassifier | n_neighbors | (none — KNN just stores the data) |
DecisionTreeClassifier | max_depth, min_samples_leaf | each split's feature and threshold |
LinearRegression | (few; it is barely tunable) | slope(s) and intercept |
LogisticRegression | C (regularization strength) | the coefficients |
RandomForestClassifier | n_estimators, max_depth | every split in every tree |
Notice that KNN has essentially no learned parameters — it just memorizes
the training set — and yet it has a crucial hyperparameter, n_neighbors.
That alone should convince you the two ideas are independent. A model can
be almost all hyperparameter (KNN) or almost all parameter
(LinearRegression).
Let us watch a single hyperparameter change a model's behavior, without any
parameter ever being "wrong." We will sweep n_neighbors on the wine
dataset and watch the cross-validated accuracy move.
The data never changed. The algorithm never changed. Only the
hyperparameter changed, and the model's quality rose and then fell. With
k=1 the model overfits — it trusts the single nearest point and its
noise. With k=50 it underfits — it averages over so many neighbors that
it blurs the real boundaries. Somewhere in between is a sweet spot. Tuning
is the disciplined search for that spot.
Hyperparameters control the bias–variance tradeoff
Almost every hyperparameter you tune is, underneath, a dial on the
bias–variance tradeoff from earlier in the course. Small k, deep
trees, and large C push toward low bias and high variance (overfitting).
Large k, shallow trees, and small C push toward high bias and low
variance (underfitting). Tuning is the search for the balance point.
Why exists: models do not know their own best settings
It would be lovely if .fit() could also figure out the best max_depth
or the best C on its own. It cannot, and the reason is fundamental, not a
missing feature.
A hyperparameter like max_depth decides how flexible the model is
allowed to be. If you let the model choose its own flexibility using the
training data, it will always choose maximum flexibility, because more
flexibility always fits the training data better. A tree with no depth
limit can memorize every training row and score a perfect 1.000. The
training data, by itself, can never warn you that this is a bad idea —
overfitting looks like success from the inside.
So you need a second opinion: a way to ask "how does this setting do on data the model did not train on?" That is exactly what validation data and cross-validation give you. Hyperparameter tuning exists because the training score is blind to overfitting, and you need an unfooled judge to choose settings.
The training score cannot choose hyperparameters
If you pick hyperparameters by maximizing the training score, you will always pick the most complex model available, because complexity always fits training data better. The training score has no opinion about generalization. You must judge hyperparameters on held-out data.
Three pots of data, three different jobs
Here is the cleanest way to keep the roles straight. Honest tuning needs data split into three jobs — though, as we will see, cross-validation lets you collapse the first two.
- The training set fits the parameters:
.fit()reads it to find slopes, splits, and coefficients. - The validation set chooses the hyperparameters: you try several settings, score each one here, and keep the winner.
- The test set is the final exam. It is unlocked exactly once, after every decision is made, to estimate real-world performance.
The validation set exists because tuning is itself a form of learning.
When you try ten values of C and keep the best, you have used the
validation data to make a decision. If that decision were made on the test
set, the test set would no longer be unseen — you would have leaked it,
slowly, one comparison at a time.
Never tune on the test set
The test set has exactly one job: to give an honest estimate of performance after every choice is locked in. The moment you use the test score to pick a hyperparameter, change a model, or decide to stop, the test set is contaminated and its number becomes optimistic. Choose hyperparameters on validation data or with cross-validation, and touch the test set once, at the end.
Cross-validation does the validation, without wasting data
Carving out a separate validation set has a cost: it is data the model cannot train on, and a single validation split can be unlucky and give a noisy verdict. You already know the cure from the cross-validation chapter. K-fold cross-validation rotates the validation role through the data, so every row gets a turn at being held out, and you average the scores for a far steadier signal.
This is why tuning in scikit-learn is built on cross-validation: it lets you choose hyperparameters honestly without permanently sacrificing a validation set. You still keep a final test set aside — cross-validation replaces the validation split, not the test split.
GridSearchCV: try every combination, automatically
Sweeping one hyperparameter by hand, as we did with n_neighbors above, is
fine for one knob. But models often have several knobs, and the best value
of one can depend on another. GridSearchCV automates the search: you hand
it a model and a small grid of values to try, and it runs
cross-validation for every combination, then reports the winner.
The "grid" is just the set of all combinations. If you give it 3 values of one hyperparameter and 2 of another, that is 3 × 2 = 6 combinations, and with 5-fold cross-validation that is 6 × 5 = 30 model fits. Keep grids small — the work multiplies fast.
Here is the smallest honest example: a tiny grid over a single hyperparameter, wrapped around a small model, judged by cross-validation.
Three things to notice. First, GridSearchCV is itself an estimator: you
call .fit() on it just like a model. Second, it only ever saw the
training data — the test set is still sealed. Third, it cross-validated
each candidate depth and kept the one with the highest average score.
Reading the results: best_params_, best_score_, best_estimator_
After fitting, a GridSearchCV exposes three attributes you will use
constantly:
best_params_— the winning combination, as a dictionary.best_score_— the winner's mean cross-validated score (this is a validation-style number, not a test-set number).best_estimator_— the model with the best settings, already refit on the entire training set for you. You can call.predict()and.score()on it directly.
best_score_ is a validation score, not a test score
best_score_ is the cross-validated score of the winning setting on the
training data. Because you chose that setting to maximize this very
number, it is slightly optimistic — the search peeked at all those folds to
pick a winner. The honest performance number comes from the untouched test
set, which we evaluate next.
The final, once-only test evaluation
Now — and only now, after the search has chosen everything — we unlock the
test set. We evaluate best_estimator_ on it exactly once.
The two numbers are usually close, but they are not the same kind of number. The CV score is what you optimized; the test score is what you report. Keeping them separate is what keeps you honest.
GridSearchCV refits the winner for you
By default GridSearchCV sets refit=True, so after the search it
automatically retrains the best setting on the full training set and
stores it as best_estimator_. You do not need to refit by hand — just use
search.best_estimator_ (or call search.predict(...), which delegates to
it).
Tuning a Pipeline: name the step, then the parameter
Real models are usually a Pipeline — preprocessing plus an estimator —
because that is how you avoid data leakage (the pipelines chapter covers
why in depth). You can tune hyperparameters inside a pipeline. The only
new thing is the naming convention: write the step name, two
underscores, then the parameter name, like model__C.
Because the scaler lives inside the pipeline, cross-validation refits it on each fold's training portion only — no test-fold statistics leak into the scaling. This is the leak-free way to tune a model that needs preprocessing, and it is why we almost always tune pipelines rather than bare estimators.
Scale inside the pipeline, not before the search
If you scale the whole dataset once and then run GridSearchCV, every
cross-validation fold has been scaled using statistics that include its own
validation rows — a subtle leak that inflates best_score_. Putting the
scaler inside the Pipeline lets each fold scale itself correctly. The
pipelines chapter returns to this; for now, the rule is: preprocessing goes
inside the thing you cross-validate.
When the grid is too big: RandomizedSearchCV
A full grid search tries every combination, which is exhaustive but explodes combinatorially. Four hyperparameters with five values each is 5⁴ = 625 combinations, times 5 folds = 3,125 fits. That is slow, and much of it is wasted on combinations that were never going to win.
RandomizedSearchCV is the pragmatic alternative for large spaces: instead
of trying everything, it samples a fixed number of random combinations
(n_iter) from the ranges you specify. Surprisingly often, a few dozen
random samples find a setting nearly as good as the exhaustive grid, in a
fraction of the time, because usually only one or two hyperparameters
really matter and random sampling explores their values well.
Grid vs. random, a rule of thumb
Reach for GridSearchCV when the search space is small — a handful of
combinations you want to check exhaustively. Reach for
RandomizedSearchCV when the space is large and a full grid would be
too slow; cap the work with n_iter and let it sample. Both use
cross-validation and expose the same best_params_ / best_score_ /
best_estimator_.
We keep this course's examples on GridSearchCV with tiny grids so they
run instantly, but reach for the randomized version the moment your grid
grows past a few dozen combinations.
The misconception that ruins careful tuning
Here is the trap that catches even experienced practitioners. You run a
search, get a great best_score_, and you are tempted to keep going:
nudge the grid, add a value, re-run, watch the number climb. Each
improvement feels real. It is not — or at least, not entirely.
When you optimize hard against the validation signal (whether a single validation set or the cross-validation folds), you start fitting the noise in that particular data, exactly as a flexible model overfits its training set. The validation score creeps up while the true performance on genuinely new data stalls or even drops. You have overfit the validation set.
This is precisely why the final test set must stay sealed through the
entire tuning process. The cross-validated best_score_ can be
optimistic — it is the number you chose to maximize — and only a test set
that played no part in any decision can tell you how much of your
improvement was real.
A rising validation score can lie
The more hyperparameter settings you try, the more chances you give random
noise to look like signal. A best_score_ that keeps improving as you
expand the grid is partly genuine and partly luck, and you cannot tell the
ratio from the validation score alone. Keep a final test set you evaluate
once. If its number is much worse than best_score_, you overfit the
validation data.
When NOT to bother tuning much
Hyperparameter tuning is not free — it costs compute, time, and a little of your sanity — and it is not always worth it.
- A simple model with few hyperparameters.
LinearRegressionhas almost nothing to tune. Do not invent a grid for a model that has no meaningful knobs. - When the default is already excellent. scikit-learn's defaults are thoughtfully chosen. Always establish the default-settings baseline first; if it already meets your needs, elaborate tuning is wasted effort.
- When better data would help more. A few more relevant features, or
cleaner labels, routinely beat squeezing the last 0.3% out of
max_depth. Tuning is a polish step, not a rescue. - When you cannot afford the search honestly. If your dataset is so small that you cannot spare both cross-validation folds and a real test set, an aggressive search will overfit the validation data. Tune lightly, or not at all.
Baseline first, tune second
Before any tuning, fit the model with default hyperparameters and record its cross-validated score. That baseline tells you whether tuning is even worth doing, and by how much it actually helped. Tuning that beats the default by a hair is rarely worth the added complexity and the extra risk of overfitting the validation set.
Real-world applications
Hyperparameter tuning runs quietly inside almost every deployed model. A fraud-detection team tunes the regularization strength of a logistic model to trade false alarms against missed fraud. A demand-forecasting team sweeps tree depth and the number of trees to balance accuracy against training cost. A medical-imaging group cross-validates the neighborhood size of a nearest-neighbor classifier so it generalizes across hospitals. In every case the pattern is the same as the tiny examples above: define a small space of settings, let cross-validation choose, and confirm the choice on data that was never part of the decision.
Your turn
You will tune a single hyperparameter of a decision tree on the wine dataset, honestly.
- The test set is already split off for you (
X_train,X_test,y_train,y_test). Do not touch the test set until step 5. - Build a
param_griddict that triesmax_depthvalues2,3, and4. Name the key exactly"max_depth". - Create a
GridSearchCVcalledsearcharoundDecisionTreeClassifier(random_state=0), passing yourparam_gridandcv=5. Fit it on the training data only. - Read
search.best_params_intobest_paramsandsearch.best_score_intobest_cv_score. - Evaluate
search.best_estimator_on the test set with.score(...)and store that intest_score.
The hidden tests check that the grid was the right size, that
best_params chose a depth from your grid, that best_cv_score is a
sensible accuracy, and that you produced a test_score.
Check your understanding
Which of the following is a hyperparameter rather than a parameter?
The slope coefficient that LinearRegression computes from the data
The n_neighbors value you pass to KNeighborsClassifier
The split thresholds a decision tree learns at each node
The coefficients a logistic regression fits to each feature
Why can a model not simply learn its own best hyperparameters from the training data during .fit()?
Because scikit-learn forbids it for licensing reasons
Because hyperparameters must always be integers, which cannot be optimized
Because a hyperparameter like flexibility, chosen to maximize the training score, would always pick maximum complexity and overfit
Because hyperparameters have no effect on the trained model
After search.fit(X_train, y_train), what does search.best_score_ represent?
The accuracy of the best model on the held-out test set
The training accuracy of the best model
The mean cross-validated score of the winning hyperparameter setting on the training data
A random number used to seed the search
You tune a Pipeline whose steps are named "scaler" and "model", and you want to search the C of the logistic regression in the "model" step. Which grid key is correct?
"C"
"model.C"
"model__C"
"scaler__C"
You keep expanding your grid and re-running the search, and best_score_ slowly climbs. Why is it still essential to keep a final untouched test set?
Because best_score_ is always exactly equal to the test score
Because optimizing hard against the validation signal can overfit it, making best_score_ optimistic; only a test set used in no decision reveals true performance
Because the test set trains the final model
Because grids must contain a prime number of values
Evaluating Clusters
Clustering has no answer key, so how do you know if the groups are any good? The silhouette score and its limits — plus why cluster evaluation is part math, part judgment.
Model Interpretation and Feature Importance
A model that works is not enough — you usually need to know why it works. How to ask a model which features drove its predictions, and the traps in every answer it gives.