Dataslope logoDataslope

Cross-Validation

One train/test split gives one noisy estimate. Cross-validation averages many, turning a lucky-or-unlucky number into a reliable one.

The train/test split taught us to evaluate on held-out data. But it has a quiet weakness: the score you get depends on which rows happened to land in the test set. Shuffle differently and the number changes. On small datasets that wobble can be large enough to make you pick the wrong model. Cross-validation fixes this by not betting everything on a single split.

The problem: one split is one roll of the dice

Let us prove the wobble is real. We will evaluate the same model on the same data, changing only the random seed of the split.

Code Block
Python 3.13.2

Nothing changed but the luck of the draw, yet the accuracy swings by several percentage points. If you had run one split and reported its number as "the" accuracy, you might have been several points too optimistic or too pessimistic — and you would never have known. Which split is the "true" one? None of them. The truth is somewhere in the middle, and a single split cannot tell you where.

A single split hides its own uncertainty

One train/test split gives you a point estimate with no error bars. With a few hundred rows, that estimate can easily be off by several points. Model comparisons made on a single split are notoriously unreliable.

The fix: k-fold cross-validation

Instead of one split, make several and average. K-fold cross-validation divides the data into k equal parts ("folds"). It then runs k rounds: in each round, one fold is the test set and the other k-1 folds are the training set. Every row gets to be in the test set exactly once.

The average of the k scores is your performance estimate, and their spread tells you how uncertain it is. Because every row is used for testing once and for training k-1 times, you squeeze far more signal out of a small dataset than a single split allows.

cross_val_score: one line to do it all

scikit-learn wraps the whole procedure in a single function.

Code Block
Python 3.13.2

That mean is a far more trustworthy estimate than any single split, and the standard deviation is the error bar you were missing. The right way to report a cross-validated result is mean plus or minus standard deviation — never a single bare number pretending to be exact.

Stratification comes free for classifiers

When you pass a classifier to cross_val_score, scikit-learn uses stratified k-fold by default, preserving each class's proportion in every fold. You get the benefit of stratify=y automatically. For explicit control, pass cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0).

Choosing k

k=5 and k=10 are the standard choices. The tradeoff:

  • Larger k → each training set is bigger (closer to using all the data), so the estimate has less bias, but you train more times (slower) and the folds overlap more.
  • Smaller k → faster, but each training set is smaller, so the estimate can be slightly pessimistic.

The extreme case k = n (one row per fold) is leave-one-out cross-validation. It uses almost all the data for every fit but requires n fits and can be noisy. For most work, 5 or 10 folds is the sweet spot.

The leakage trap CV makes obvious

Here is a subtle and dangerous mistake. If you scale your features using the whole dataset before cross-validating, every fold's "test" portion has already influenced the scaler — information has leaked from test to train, and your CV score is optimistic. The fix is to put preprocessing inside the cross-validation, so it is re-fit from scratch on each fold's training data. A Pipeline does this automatically.

Code Block
Python 3.13.2

Because the StandardScaler lives inside the pipeline, each fold scales using only its own training portion. The cross-validation now faithfully mimics what happens at deployment, where you must scale new data using statistics learned in the past. We will build pipelines properly in their own chapter — for now, the lesson is: anything that learns from data belongs inside the cross-validation loop.

Preprocess inside, not before

Fitting a scaler, imputer, or feature selector on the full dataset before cross-validating leaks the test folds and inflates your score. Wrap preprocessing and model together in a Pipeline so every fold is honest.

Cross-validation does not replace the test set

A common confusion: "If I cross-validate, do I still need a test set?" For the most rigorous estimate, yes. Here is the reasoning. You typically use cross-validation to choose things — which model, which hyperparameters. The moment you select the option with the best CV score, that score is slightly optimistic, because you picked the winner of a contest with some luck in it. A final, untouched test set gives you one clean number after all the choosing is done.

Cross-validation is for making decisions with a stable signal. The test set is for the final, honest report after the decisions are made.

When cross-validation needs care

The plain k-fold shuffle assumes rows are independent and interchangeable — the same assumption the random split makes, and it fails in the same places:

  • Time series. Use TimeSeriesSplit, which always trains on the past and tests on the future, never the reverse.
  • Grouped data (repeated patients, users, devices). Use GroupKFold so all of a group's rows stay together in the same fold; otherwise the model recognizes the individual across folds and your score is fantasy.

Independence again

If a random train/test split would cheat on your data, so will plain k-fold. Match the splitter to the structure of your problem: TimeSeriesSplit for time, GroupKFold for grouped data.

Common misconceptions

  • "Cross-validation prevents overfitting." It does not change how a model fits — it measures generalization more reliably so you can detect overfitting and choose better. The fixing is still up to you.
  • "More folds is always better." Beyond 10, you pay a lot of compute for diminishing returns, and leave-one-out can be noisy.
  • "The CV mean is the exact production accuracy." It is a better estimate than a single split, but still an estimate with a standard deviation. Report the spread.
  • "I can cross-validate, then keep tweaking until the CV score is great." Tune too obsessively against the CV score and you start overfitting it. That is why a final held-out test set exists.

Real-world applications

Cross-validation is the default evaluation protocol in nearly every applied ML project and every Kaggle competition. When a paper claims a model is better, the credible ones back it with cross-validated scores and standard deviations, not a single lucky split. Whenever data is scarce — medical studies, niche business problems — cross-validation is what lets a few hundred rows yield a trustworthy estimate.

Your turn

Challenge
Python 3.13.2
Cross-validate a pipeline

Using the breast cancer dataset (X, y provided):

  1. Build a pipeline pipe with make_pipeline(StandardScaler(), LogisticRegression(max_iter=5000)).
  2. Run 10-fold cross-validation with cross_val_score, storing the array of fold scores in cv_scores.
  3. Store the mean in mean_score and the standard deviation in std_score.

The tests check that cv_scores has 10 entries, that mean_score matches cv_scores.mean(), and that the mean is a strong accuracy (above 0.9) — which it should be for this dataset.

Check your understanding

QuestionSelect one

What is the main problem with evaluating a model on a single train/test split, especially on a small dataset?

It is slower than cross-validation

It always overestimates accuracy

The score depends on which rows happened to land in the test set, so it is a noisy estimate with no error bar

It cannot be used with classifiers

QuestionSelect one

In 5-fold cross-validation, how many times is each row used for testing?

Five times

Zero times

Exactly once — each row sits in the test fold in exactly one of the five rounds

It depends on the random seed

QuestionSelect one

Why should preprocessing like StandardScaler be placed inside a Pipeline that is cross-validated, rather than applied to the whole dataset first?

It runs faster inside a pipeline

Pipelines are required by cross_val_score

So the scaler is re-fit on each fold's training portion only, preventing information from the test fold leaking into training

It changes the number of folds

QuestionSelect one

After using cross-validation to choose your model and hyperparameters, why keep a separate untouched test set?

Because cross-validation cannot score classifiers

Because selecting the option with the best CV score makes that score slightly optimistic; a final untouched test set gives one clean, unbiased number

Because the test set trains the final model

Because cross-validation uses the test set internally

QuestionSelect one

You are predicting next month's revenue from monthly history. Why is plain k-fold cross-validation inappropriate?

k-fold cannot be used for regression

The data is ordered in time; plain k-fold would train on future months to predict past ones, which is impossible in reality — use TimeSeriesSplit

k-fold requires at least 100 folds for time series

Revenue data cannot be cross-validated

On this page