The Train/Test Split
Why we hide data from our own models — the single most important habit in machine learning, and the foundation of every honest evaluation.
If you remember one idea from this entire course, make it this one: a model must be judged on data it has never seen. Everything else — overfitting, cross-validation, every metric in the evaluation chapters — is a refinement of that single sentence.
This page is about the simplest way to honor it: the train/test split.
The problem it solves
Imagine a student who is given the exact questions and answers for an exam the night before. The next day they score 100%. Did they learn the material? You have no idea. The score is meaningless because they were tested on the very thing they memorized.
Machine learning models can memorize too. A flexible enough model can essentially store the training data and replay it. If you then "test" it on that same data, it looks brilliant — and tells you nothing about how it will do tomorrow on a customer, a patient, or a transaction it has never encountered.
The cardinal sin
Evaluating a model on the same data it was trained on is the single most common way beginners fool themselves. The number you get is not performance — it is a measure of memory. We need a number that predicts the future, not one that flatters the past.
The fix: hold some data back
Before training, we randomly set aside a portion of the data — the test set — and lock it in a drawer. The model never sees it during training. We train on the rest (the training set), and only at the very end do we unlock the drawer and measure performance on those held-out rows.
Because the test rows played no part in training, the model's score on them is a fair stand-in for how it will behave on genuinely new data. That is the whole trick.
train_test_split in practice
scikit-learn gives us one function for this. It shuffles the rows and cuts them into two groups.
Four things come back, always in this order: X_train, X_test, y_train, y_test. The features and their labels are split together, row for row,
so y_train still lines up with X_train.
X and y, a quick reminder
By convention X (capital) is the feature matrix — one row per example,
one column per feature — and y (lowercase) is the target we want to
predict. We will use this naming on every page.
Seeing the trap with your own eyes
Let us make the danger concrete. We will train a deliberately over-flexible decision tree, then score it two ways: on the training data it memorized, and on the held-out test data. Watch the gap.
The training score is a perfect (or near-perfect) 1.000. If that were
our report card, we would declare victory. But the test score — the one
that actually matters — is noticeably lower. That gap is overfitting,
and the only reason we can see it at all is that we held data back.
A perfect training score is a red flag, not a trophy
When a model scores 100% on its training data, the right reaction is suspicion, not celebration. It usually means the model has memorized noise that will not repeat on new data. The test score is your reality check.
random_state: making the split reproducible
train_test_split shuffles before cutting, so without a fixed seed you
would get a different split every run — and a slightly different score each
time. Passing random_state=0 (any fixed integer works) freezes the
shuffle so your results are reproducible and you can compare models fairly.
Same seed, fair comparison
When comparing two models, give them the same random_state so they are
trained and tested on exactly the same rows. Otherwise you might be
comparing models, or you might just be comparing two lucky draws — you
would not be able to tell which.
stratify: keeping the class balance honest
For classification, a purely random split can get unlucky and put, say, most of the rare class into the test set. Then the training set barely contains examples of it. The fix is stratified splitting: preserve each class's proportion in both halves.
With stratify=y, the test set reliably contains 10% of class 1 — the same
proportion as the full data. Use it whenever classes are imbalanced.
How big should the test set be?
There is no magic number, but the tradeoff is intuitive:
- Bigger test set → a more precise estimate of performance, but less data to learn from, so the model itself may be weaker.
- Smaller test set → more training data, but a noisier estimate that swings with luck.
Common choices are test_size=0.2 or 0.25. With only a few hundred rows,
even 20% is a thin test set, and a single split becomes unreliable — which
is exactly the motivation for cross-validation, coming up in a few
pages.
A subtle leak to avoid
The split must happen before you do anything that learns from the data —
scaling, filling missing values, selecting features. If you scale using the
mean of the whole dataset and then split, information from the test set
has already leaked into training. Later, Pipeline will make doing this
correctly the path of least resistance.
When a random split is the wrong tool
The plain random split assumes every row is independent and interchangeable. Sometimes that is false, and a random split quietly cheats:
- Time series. To predict the future you must train on the past and test on later data. A random split lets the model peek at the future to predict the past. Split by time instead.
- Grouped data. If the same patient, user, or device appears in many
rows, a random split can put some of their rows in training and some in
test. The model recognizes the individual, not the pattern. Split by
group (scikit-learn has
GroupShuffleSplitfor this).
Independence is an assumption, not a guarantee
train_test_split is correct only when rows are independent. For temporal
or grouped data, a naive random split produces an optimistic score that
collapses in production. Always ask: could a test row share its secret with
a training row?
Common misconceptions
- "More test data is always better." No — it is a tradeoff. Past a point you are starving the model to slightly sharpen an estimate.
- "The test score is the exact accuracy I will get in production." It is an estimate from one finite sample, with its own uncertainty. A different split gives a different number.
- "Once I've looked at the test set, I can keep tuning against it." The moment you make decisions based on the test score, it stops being unseen. Repeatedly tuning to the test set leaks it in slowly. That is what a separate validation set and cross-validation are for.
Real-world applications
Every credible deployed model rests on this habit. A bank estimating default risk, a hospital triaging scans, a streaming service ranking shows — all measure quality on held-out data first, because the cost of discovering overfitting after launch is measured in money, trust, or lives.
Your turn
The wine dataset has 178 samples in 3 classes.
- Load it with
load_wine(return_X_y=True)intoXandy. - Split into train/test with 30% in the test set,
random_state=42, and stratified ony. - Train a
DecisionTreeClassifier(random_state=42)on the training set. - Store its training accuracy in
train_accand its test accuracy intest_acc(both via.score(...)).
The hidden tests check that the split is the right size, that it is
stratified, and that train_acc is higher than test_acc (the
overfitting gap).
Check your understanding
Why do we evaluate a model on a held-out test set instead of on the data it was trained on?
Because test sets are easier to compute metrics on
Because a model can memorize its training data, so its training score reflects memory rather than its ability to generalize to new data
Because scikit-learn requires it
Because the training data is usually corrupted
A decision tree scores 1.000 on the training data and 0.890 on the test data. What is the most reasonable interpretation?
The model is perfect and ready to ship
Something is broken; scores should be equal
The model has overfit — it memorized the training data, and the test score (0.890) is the honest estimate of real-world performance
The test set must be too small to trust at all
What does stratify=y do in train_test_split, and when does it matter most?
It sorts the rows by y before splitting
It removes the rare class to balance the data
It preserves each class's proportion in both the train and test sets, which matters most when classes are imbalanced
It scales the features to have equal variance
You are predicting tomorrow's sales from historical daily data. Why is a plain random train_test_split a poor choice here?
Random splits are slower on time-series data
The rows are ordered in time; a random split lets the model train on future days to predict past ones, which it can never do in reality
Time-series data cannot be used in scikit-learn
You must always use the entire dataset for training with time series
Why is fixing random_state to the same value useful when comparing two models?
It makes the models train faster
It guarantees both models reach 100% accuracy
It makes both models see exactly the same train/test rows, so any difference in scores reflects the models, not two different lucky splits
It changes the metric used to score the models
The scikit-learn API
scikit-learn's quiet superpower is consistency. Every model — linear regression, nearest neighbors, k-means, and hundreds more — wears the same interface. Learn fit, predict, transform, score, and predict_proba once, and you know how to drive them all.
Your First Model, End to End
The complete machine learning workflow on one tiny dataset — load, split, train, evaluate, predict — so the shape of every future model becomes second nature.