Your First Model, End to End
The complete machine learning workflow on one tiny dataset — load, split, train, evaluate, predict — so the shape of every future model becomes second nature.
You have seen the cardinal rule (judge a model on data it has never seen)
and you have seen train_test_split carve a dataset into a part to learn
from and a part to be honest with. Now we put the whole thing together and
build a real, working model from start to finish.
By the end of this page you will have trained a classifier on the famous iris flowers, asked it to label flowers it never saw, and measured exactly how often it was right. More importantly, you will have internalized a shape — load, split, train, evaluate — that every model in this course repeats without exception. Once that shape is automatic, learning a new algorithm becomes a matter of swapping one line.
What you already know going in
This page assumes you have read the welcome page and The Train/Test Split. We will not re-derive why we hold data back — we will just do it, correctly, as a habit. If the phrase "test set" feels fuzzy, revisit that page first; everything here builds on it.
The problem this workflow solves
You have a table of data and a question you want answered automatically. Given a flower's measurements, which species is it? Given a patient's labs, is this tumor malignant? Given a house's features, what will it sell for? Each of these is the same kind of task: learn a rule from labeled examples, then apply that rule to new, unlabeled cases.
The end-to-end workflow is the disciplined way to do that without
fooling yourself. Anyone can call .fit() and admire a number. The
workflow's job is to make sure the number you admire actually predicts the
future. It does that by enforcing one separation — the data you learn from
versus the data you grade yourself on — at every single step.
Why memorize a shape?
Algorithms come and go, but the workflow is invariant. A linear regression, a hundred-tree random forest, and a nearest-neighbor classifier are all trained and evaluated with the exact same four moves. Learn the moves once and every new model is just a one-line substitution.
The four moves, in order
Here is the entire workflow as a picture. Read it top to bottom: data comes in, gets split, one half teaches the model, the other half grades it, and finally the finished model labels brand-new flowers.
Notice the two arrows leaving the trained model. One goes to evaluation (the test set, to find out how good the model is) and one goes to use (genuinely new inputs, where there is no answer key). These are different activities and it is worth keeping them straight: you score on the test set, and you predict on the real world.
Move 1: load the data
We will use load_iris, a dataset of 150 iris flowers. Each flower has
four measurements — sepal length, sepal width, petal length, petal width —
and belongs to one of three species: setosa, versicolor, or
virginica. The features are X (150 rows, 4 columns) and the species
labels are y (150 values, each 0, 1, or 2).
The four numbers in X[0] are that flower's measurements in centimeters,
and the 0 in y[0] is its species code. The model's job is to learn the
connection between those four numbers and that code, well enough to label a
flower it has never measured.
The numbers behind the names
scikit-learn stores labels as integers (0, 1, 2), not the species strings,
because models do arithmetic on numbers, not words. The dataset keeps the
human-readable names in load_iris().target_names, and we will use them
later to print results people can actually read.
Let us peek at the names so the integer codes mean something.
Fifty of each species, four measurements each. A balanced, tidy dataset — which is exactly why it is the classic first example. Real data is rarely this clean, and later pages deal with the mess. For now, a clean dataset lets us focus on the workflow itself.
Move 2: split into train and test
We hold back a slice of flowers so we can grade the model on examples it
never studied. We will put 25% in the test set, fix random_state so the
split is reproducible, and stratify on y so all three species stay
proportionally represented in both halves.
From here on, X_test and y_test go in a drawer. The model will not see
them during training. We will only unlock them once — at the end — to get a
single honest score.
Split before you touch the data
The split is the first thing you do, before any scaling, cleaning, or feature selection that learns from the numbers. If you compute something from the whole dataset and then split, information from the test set has already bled into training. We are using raw measurements here so there is nothing to leak, but keep the habit: split first, learn second.
Move 3: choose an estimator and fit it
Now we pick a model. In scikit-learn, every model is an estimator — an
object with a .fit() method that learns from data and a .predict()
method that applies what it learned. We will use
KNeighborsClassifier, one of the most intuitive classifiers there is.
Its idea, in one sentence: to label a new flower, find the few training
flowers most similar to it and let them vote. "Similar" means closest in
measurement-space. With n_neighbors=5, each new flower is labeled by the
majority species among its five nearest training neighbors. There is no
heavy math — the model essentially remembers the training flowers and
compares.
That is the whole of training: construct the estimator, then call .fit()
on the training features and labels. The model has now "learned" — for
KNN that simply means it has stored the training flowers in a way it can
search quickly.
The fit / predict contract
Almost every scikit-learn estimator follows the same two-method contract:
.fit(X, y) learns from labeled data, and .predict(X) produces labels for
new data. Swap KNeighborsClassifier for LogisticRegression or
RandomForestClassifier and these two lines do not change. That uniformity
is the whole reason scikit-learn is pleasant to use.
If you would rather use a different estimator, the change is a single line.
Here is the identical workflow with LogisticRegression — a linear model
you will study in depth on its own page — to drive home how interchangeable
estimators are.
Same four moves, a different model, and the rest of your code is untouched. That is the payoff of memorizing the shape.
Move 4: predict and score
A trained model is useful in two distinct ways, and the welcome diagram already hinted at both.
Predict asks the model to label specific inputs. Let us hand it the first few test flowers and compare its guesses to the truth.
Most or all of those eight will match. But eight flowers is far too few to judge a model — you could get lucky or unlucky. For an honest, stable number we score on the entire test set at once.
Score asks the model how often it is right across all held-out
examples. For a classifier, .score() returns accuracy: the fraction
of test flowers labeled correctly.
The two numbers are identical, because .score() for a classifier is
exactly "predict everything, then take the fraction correct." Around 97% on
unseen flowers — a genuinely trustworthy result, because not one of those
test flowers influenced training.
What .score() means depends on the task
For a classifier, .score() returns accuracy (fraction correct). For a
regressor — a model predicting a number — .score() returns R²
instead, a different quantity entirely. Same method name, different meaning,
because "how good is this?" means different things for categories versus
numbers. The Regression Metrics page and the course's
classification-evaluation material unpack what these scores do and do not
tell you. For now, treat accuracy as a first, rough headline.
Predicting on a brand-new flower
In the real world you do not have an answer key — that is the entire point of a model. You measure a flower in the field, hand the four numbers to the trained model, and it tells you the species. Let us do exactly that.
Two details here are easy to trip over, and both are worth burning in.
First, the input is a list of lists — [[...]], not [...].
scikit-learn always expects a 2D array of shape (n_samples, n_features),
even for a single sample. One flower is "one row of four features," so it
is [[5.1, 3.5, 1.4, 0.2]]. Hand it a flat list and you will get a shape
error.
Second, predict returns an array, one prediction per input row, so we
index prediction[0] to read the single answer.
The shape mistake everyone makes once
model.predict([5.1, 3.5, 1.4, 0.2]) fails — that is a single flat list,
which scikit-learn reads as ambiguous. Always pass a 2D structure:
[[5.1, 3.5, 1.4, 0.2]] for one sample, or a list of such rows for many.
The rule is "rows are samples, columns are features," even when there is
only one row.
Try changing those four numbers and re-running. Large petal measurements
(say [6.5, 3.0, 5.5, 2.0]) will push the prediction toward virginica;
tiny petals lean setosa. You are now using a machine learning model the
way it is meant to be used — feeding it new inputs and trusting its output
because you measured its accuracy first.
The whole workflow on one screen
Everything above, assembled into the canonical shape you will repeat for the rest of the course. This is the template; future pages mostly change which estimator sits on the highlighted line.
Six lines of logic, ignoring imports and comments. Load, split, train, evaluate, use. Read it until it feels boring — boring is the goal, because a workflow you do not have to think about frees your attention for the decisions that actually matter.
When this exact recipe is not enough
The four-move workflow is always the backbone, but the simple version above takes some shortcuts that real problems will not allow. It is worth knowing where the shortcuts are so you recognize when a later page is filling a gap.
- The data needed no preparation. Iris measurements are all clean
numbers on similar scales. Real datasets have missing values, text
categories, and features on wildly different scales. The fix —
PipelineandColumnTransformer— gets its own page, and crucially it keeps the prep from leaking across the split. - One split can be noisy. With only 150 flowers, a single test set is a
thin sample; a different
random_stateshifts the score a little. Cross-validation replaces one split with several and reports a steadier estimate. That, too, is a dedicated page. - We never tuned anything. We picked
n_neighbors=5out of thin air. Choosing such settings properly — without peeking at the test set — is what the tuning pages are about.
Never tune against the test set
It is tempting to try n_neighbors=3, check the test score, try 7, check
again, and keep the best. The instant you make choices based on the test
score, that score stops being honest — you have started fitting the test
set by hand. Model selection belongs to a validation set or
cross-validation, never the final test set. We flag it here so the habit
forms early.
None of this changes the shape. Every refinement slots inside load → split → train → evaluate. You are not learning a new process later; you are learning to do each move more carefully.
Common misconceptions
- "Fitting and predicting are the same step." They are deliberately
separate.
.fit()learns from labeled data once;.predict()is then called as often as you like on new inputs, with no further learning. A deployed model fits during training and predicts forever after. - "A higher training score means a better model." The training score measures memory, not skill. The number that matters is the test score. A model can ace the training data and flop on new data — that gap is overfitting, covered in its own page.
- "
.score()always means accuracy." Only for classifiers. For regressors it returns R². The method name is shared; the meaning is not. - "
predictneeds the labels too." No —predicttakes only features and returns its guess for the labels. You passyto.fit()(to learn) and to.score()(to grade), but never to.predict(). - "One sample can be a flat list." scikit-learn always wants 2D input,
shape
(n_samples, n_features). A single sample is still a row inside a list:[[...]].
Real-world applications
Strip away the flowers and this is the spine of essentially every supervised machine learning system in production:
- An email provider loads millions of labeled messages, splits off a held-out set, trains a spam classifier, evaluates its accuracy, and then predicts spam-or-not on every new email that arrives.
- A bank does the same with loan outcomes to predict default risk; a hospital with labeled scans to flag disease; a streaming service with watch history to predict what you will play next.
The dataset is bigger, the estimator fancier, the preparation more elaborate — but the four moves are identical to what you just ran on 150 flowers. That is why this page matters more than any single algorithm: you have learned the frame that holds all of them.
Your turn
Reproduce the complete workflow on the iris dataset, then store the final accuracy so the tests can check it.
- Load iris with
load_iris(return_X_y=True)intoXandy. - Split into train/test with 20% in the test set,
random_state=0, and stratified ony. Use the standard four-variable unpacking:X_train, X_test, y_train, y_test. - Create a
KNeighborsClassifier(n_neighbors=5)calledmodeland.fit()it on the training data only. - Store the model's accuracy on the test set in a variable named
accuracy(usemodel.score(...)). - Use the trained model to predict the species code of the single new
flower
[[6.0, 2.7, 5.1, 1.6]]and store the integer result innew_pred.
The hidden tests check the split sizes, that model is fitted, that
accuracy is a sensible high number, and that new_pred is a valid
species code (0, 1, or 2).
Check your understanding
What are the four moves of the end-to-end workflow, in order?
Split, load, evaluate, train
Load the data, split into train/test, train (fit) on the training set, evaluate (score) on the test set
Train, evaluate, load, split
Load, train, split, evaluate
To predict the species of a single new flower with measurements 5.1, 3.5, 1.4, 0.2, what do you pass to model.predict(...)?
A flat list: [5.1, 3.5, 1.4, 0.2]
The four numbers as separate arguments
A 2D structure with one row: [[5.1, 3.5, 1.4, 0.2]]
The training labels y_train as well
For a classifier, what does model.score(X_test, y_test) return?
Accuracy — the fraction of test samples the model labeled correctly
The number of training samples used
The model's internal loss value
The probability that the model is correct on average
You want to try KNeighborsClassifier instead of LogisticRegression in your workflow. How much of the load/split/train/evaluate code has to change?
All of it — each estimator needs a different workflow
The split must be redone with a different random_state
Essentially one line — the line that constructs the estimator; .fit(), .score(), and .predict() are called the same way
The data must be reloaded in a different format
A classmate calls model.predict(X_test) and is confused that they did not have to pass y_test. What is the right explanation?
predict always needs the labels; their code has a bug
predict uses y_test internally even though it is not written
predict takes only features and returns the model's guessed labels; you pass y to .fit() to learn and to .score() to grade, but never to .predict()
predict and score are the same method
The Train/Test Split
Why we hide data from our own models — the single most important habit in machine learning, and the foundation of every honest evaluation.
Generalization, Overfitting, and Underfitting
The central tension of machine learning — a model must be flexible enough to learn the pattern but disciplined enough not to memorize the noise.