Cross-Validation
One train/test split gives one noisy estimate. Cross-validation averages many, turning a lucky-or-unlucky number into a reliable one.
The train/test split taught us to evaluate on held-out data. But it has a quiet weakness: the score you get depends on which rows happened to land in the test set. Shuffle differently and the number changes. On small datasets that wobble can be large enough to make you pick the wrong model. Cross-validation fixes this by not betting everything on a single split.
The problem: one split is one roll of the dice
Let us prove the wobble is real. We will evaluate the same model on the same data, changing only the random seed of the split.
Nothing changed but the luck of the draw, yet the accuracy swings by several percentage points. If you had run one split and reported its number as "the" accuracy, you might have been several points too optimistic or too pessimistic — and you would never have known. Which split is the "true" one? None of them. The truth is somewhere in the middle, and a single split cannot tell you where.
A single split hides its own uncertainty
One train/test split gives you a point estimate with no error bars. With a few hundred rows, that estimate can easily be off by several points. Model comparisons made on a single split are notoriously unreliable.
The fix: k-fold cross-validation
Instead of one split, make several and average. K-fold cross-validation
divides the data into k equal parts ("folds"). It then runs k rounds: in
each round, one fold is the test set and the other k-1 folds are the
training set. Every row gets to be in the test set exactly once.
The average of the k scores is your performance estimate, and their
spread tells you how uncertain it is. Because every row is used for testing
once and for training k-1 times, you squeeze far more signal out of a
small dataset than a single split allows.
cross_val_score: one line to do it all
scikit-learn wraps the whole procedure in a single function.
That mean is a far more trustworthy estimate than any single split, and the standard deviation is the error bar you were missing. The right way to report a cross-validated result is mean plus or minus standard deviation — never a single bare number pretending to be exact.
Stratification comes free for classifiers
When you pass a classifier to cross_val_score, scikit-learn uses
stratified k-fold by default, preserving each class's proportion in
every fold. You get the benefit of stratify=y automatically. For explicit
control, pass cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0).
Choosing k
k=5 and k=10 are the standard choices. The tradeoff:
- Larger
k→ each training set is bigger (closer to using all the data), so the estimate has less bias, but you train more times (slower) and the folds overlap more. - Smaller
k→ faster, but each training set is smaller, so the estimate can be slightly pessimistic.
The extreme case k = n (one row per fold) is leave-one-out
cross-validation. It uses almost all the data for every fit but requires n
fits and can be noisy. For most work, 5 or 10 folds is the sweet spot.
The leakage trap CV makes obvious
Here is a subtle and dangerous mistake. If you scale your features using the
whole dataset before cross-validating, every fold's "test" portion has
already influenced the scaler — information has leaked from test to train,
and your CV score is optimistic. The fix is to put preprocessing inside
the cross-validation, so it is re-fit from scratch on each fold's training
data. A Pipeline does this automatically.
Because the StandardScaler lives inside the pipeline, each fold scales
using only its own training portion. The cross-validation now faithfully
mimics what happens at deployment, where you must scale new data using
statistics learned in the past. We will build pipelines properly in their
own chapter — for now, the lesson is: anything that learns from data
belongs inside the cross-validation loop.
Preprocess inside, not before
Fitting a scaler, imputer, or feature selector on the full dataset before
cross-validating leaks the test folds and inflates your score. Wrap
preprocessing and model together in a Pipeline so every fold is honest.
Cross-validation does not replace the test set
A common confusion: "If I cross-validate, do I still need a test set?" For the most rigorous estimate, yes. Here is the reasoning. You typically use cross-validation to choose things — which model, which hyperparameters. The moment you select the option with the best CV score, that score is slightly optimistic, because you picked the winner of a contest with some luck in it. A final, untouched test set gives you one clean number after all the choosing is done.
Cross-validation is for making decisions with a stable signal. The test set is for the final, honest report after the decisions are made.
When cross-validation needs care
The plain k-fold shuffle assumes rows are independent and interchangeable — the same assumption the random split makes, and it fails in the same places:
- Time series. Use
TimeSeriesSplit, which always trains on the past and tests on the future, never the reverse. - Grouped data (repeated patients, users, devices). Use
GroupKFoldso all of a group's rows stay together in the same fold; otherwise the model recognizes the individual across folds and your score is fantasy.
Independence again
If a random train/test split would cheat on your data, so will plain k-fold.
Match the splitter to the structure of your problem: TimeSeriesSplit for
time, GroupKFold for grouped data.
Common misconceptions
- "Cross-validation prevents overfitting." It does not change how a model fits — it measures generalization more reliably so you can detect overfitting and choose better. The fixing is still up to you.
- "More folds is always better." Beyond 10, you pay a lot of compute for diminishing returns, and leave-one-out can be noisy.
- "The CV mean is the exact production accuracy." It is a better estimate than a single split, but still an estimate with a standard deviation. Report the spread.
- "I can cross-validate, then keep tweaking until the CV score is great." Tune too obsessively against the CV score and you start overfitting it. That is why a final held-out test set exists.
Real-world applications
Cross-validation is the default evaluation protocol in nearly every applied ML project and every Kaggle competition. When a paper claims a model is better, the credible ones back it with cross-validated scores and standard deviations, not a single lucky split. Whenever data is scarce — medical studies, niche business problems — cross-validation is what lets a few hundred rows yield a trustworthy estimate.
Your turn
Using the breast cancer dataset (X, y provided):
- Build a pipeline
pipewithmake_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)). - Run 10-fold cross-validation with
cross_val_score, storing the array of fold scores incv_scores. - Store the mean in
mean_scoreand the standard deviation instd_score.
The tests check that cv_scores has 10 entries, that mean_score
matches cv_scores.mean(), and that the mean is a strong accuracy
(above 0.9) — which it should be for this dataset.
Check your understanding
What is the main problem with evaluating a model on a single train/test split, especially on a small dataset?
It is slower than cross-validation
It always overestimates accuracy
The score depends on which rows happened to land in the test set, so it is a noisy estimate with no error bar
It cannot be used with classifiers
In 5-fold cross-validation, how many times is each row used for testing?
Five times
Zero times
Exactly once — each row sits in the test fold in exactly one of the five rounds
It depends on the random seed
Why should preprocessing like StandardScaler be placed inside a Pipeline that is cross-validated, rather than applied to the whole dataset first?
It runs faster inside a pipeline
Pipelines are required by cross_val_score
So the scaler is re-fit on each fold's training portion only, preventing information from the test fold leaking into training
It changes the number of folds
After using cross-validation to choose your model and hyperparameters, why keep a separate untouched test set?
Because cross-validation cannot score classifiers
Because selecting the option with the best CV score makes that score slightly optimistic; a final untouched test set gives one clean, unbiased number
Because the test set trains the final model
Because cross-validation uses the test set internally
You are predicting next month's revenue from monthly history. Why is plain k-fold cross-validation inappropriate?
k-fold cannot be used for regression
The data is ordered in time; plain k-fold would train on future months to predict past ones, which is impossible in reality — use TimeSeriesSplit
k-fold requires at least 100 folds for time series
Revenue data cannot be cross-validated
The Bias–Variance Tradeoff
A deeper look at why models fail — splitting error into the part caused by wrong assumptions and the part caused by oversensitivity to the training sample.
Linear Regression
The straight line through your data — the simplest, most interpretable way to predict a number, and the foundation every other regression model is measured against.