The scikit-learn API
scikit-learn's quiet superpower is consistency. Every model — linear regression, nearest neighbors, k-means, and hundreds more — wears the same interface. Learn fit, predict, transform, score, and predict_proba once, and you know how to drive them all.
You have now trained several models without us ever dwelling on how you
called them. That was deliberate, and it reveals scikit-learn's best-kept
secret: you have been using the same handful of methods the entire time.
fit to train. predict to get answers. score to evaluate. Whether the
model was a k-nearest-neighbors classifier, a logistic regression, or a
k-means clusterer, the calls looked nearly identical.
This is not an accident — it is the single design decision that makes scikit-learn a joy to use. There are hundreds of algorithms in the library, and they all speak the same small vocabulary. Learn that vocabulary once and you can pick up any model in the ecosystem without learning a new interface. This page is about that vocabulary: the estimator API.
Why a consistent API is a superpower
Imagine if every model came with its own bespoke method names — one wanted
train(), another learn(), a third optimize(). You would spend your life
reading documentation just to call things. scikit-learn refuses that chaos.
Every estimator follows the same contract, so the knowledge you build on
your first model transfers, unchanged, to your thousandth. The API is the
thing you actually learn once and reuse forever.
The core vocabulary
Almost everything in scikit-learn is an estimator — an object that learns something from data. Estimators share a tiny set of methods, and you only need a few to be productive:
.fit(X, y)— learn from data. This is training. Supervised estimators take features and a target; unsupervised ones take justX. Afterfit, the estimator holds what it learned (its tuned parameters)..predict(X)— produce answers for new data. Used by models that output a prediction per row: regressors return numbers, classifiers return class labels, clusterers return group ids..transform(X)— produce a transformed version of the data. Used by preprocessing steps (scalers, encoders) that reshape or rescale features rather than predict an answer..score(X, y)— report a quick quality number. A built-in default metric: accuracy for classifiers, R² for regressors. Convenient for a fast check, though the metrics chapters will give you sharper tools..predict_proba(X)— for many classifiers, the predicted probability of each class, not just the single hard label.
The rhythm is almost always the same two beats: fit, then use. You fit
on training data, then predict (or transform, or score) on new data.
The lifecycle never changes
Every model you will ever meet in this course follows the same two-step lifecycle: fit once on training data, then call predict (or transform, or score) as often as you like on new data. Internalize this rhythm and new models stop being intimidating — you already know how to drive them.
Proof: the same shape across three very different models
Talk is cheap. Let us prove the claim by running three genuinely different algorithms — a regressor, a classifier, and a clusterer — and watching the calls line up. These models work in completely different ways internally, yet from the outside they are nearly indistinguishable.
Read that again and let it land. Three models with nothing in common under
the hood — one fits a line, one memorizes neighbors, one finds centroids — and
yet the code is the same two methods each time: fit, then predict. The
only difference is what comes out (a number, a class, a group). This
uniformity is exactly why, once you have trained one scikit-learn model, you
have effectively trained them all.
What this buys you
Because the interface is uniform, swapping one model for another is often a
one-line change — replace KNeighborsClassifier() with
LogisticRegression() and the rest of your code is untouched. This makes
experimenting with different algorithms almost free, which is a huge part of
why scikit-learn became the standard for classical machine learning.
A quick check
Why is it often a one-line change to swap a KNeighborsClassifier for a LogisticRegression in scikit-learn?
Because the two algorithms work the same way internally
Because scikit-learn automatically rewrites your code
Because all estimators share the same interface (fit, predict, score), so the surrounding code does not need to change when you swap the model
Because both models always give identical predictions
Estimators vs. transformers
There is one important distinction inside the estimator family. Some estimators predict an answer; others transform the data. The method they expose tells you which kind you are holding.
- A predictor (regressor, classifier, clusterer) has a
.predict(). You give it features, it gives you an answer — a number, a label, a group. - A transformer (scaler, encoder, dimensionality reducer) has a
.transform(). You give it features, it gives you reshaped features — scaled, encoded, or compressed — ready to feed into a model.
Both are estimators, and both fit first: a scaler must learn the columns'
means and spreads before it can rescale, just as a model must learn before it
can predict. The difference is purely what they produce afterward. Let us see
a transformer — StandardScaler, which rescales each feature to have mean 0
and unit variance — and confirm it exposes transform, not predict.
A transformer returns data; a predictor returns answers. That is the
whole distinction. Transformers shine when you need to prepare features
before modeling — and because they share the fit/transform interface, they
chain together cleanly with models inside a Pipeline, which the
pipelines and ColumnTransformer chapter is devoted to.
fit_transform is a convenience, not magic
You will often see scaler.fit_transform(X). It simply does fit then
transform in one call — handy, but identical to doing them separately. One
caution: only ever fit (or fit_transform) a transformer on your
training data, then transform the test data with that already-fitted
transformer. Fitting on the test data leaks information across the split — the
exact leak the train/test page warned about. Pipelines exist largely to make
this correct by default.
score: a quick quality number
Every predictor offers a .score(X, y) that returns a single default
metric, so you can sanity-check a model in one line. The default differs by
task: classifiers return accuracy (fraction correct), regressors return
R² (the fraction of the target's variance the model explains, where 1.0 is
perfect).
score is perfect for a fast gut check, but it is intentionally simple — one
number, one default metric. Real evaluation often needs more nuance
(precision and recall for imbalanced classes, mean absolute error for
regression, and so on). The classification metrics and regression metrics
chapters unpack when score's default is enough and when it quietly misleads.
Always score on held-out data
score does not care what data you hand it — it will happily report a
flattering number on the training set. Calling model.score(X_train, y_train) measures memory, not skill. To get an honest reading, score on data
the model never trained on, exactly as the train/test page insisted.
predict_proba: probabilities, not just labels
A hard label — "spam" — throws away useful information. How sure is the
model? Many classifiers can tell you, via .predict_proba(X), which
returns the estimated probability of each class. The probabilities across the
classes sum to 1 for each row.
Those probabilities are the raw material for so much that follows: setting a
custom decision threshold (flag fraud only above 80% confidence), ranking
predictions by confidence, and drawing ROC curves. The ROC curves and AUC
chapter builds directly on predict_proba. Note that not every model exposes
it — regressors do not (their output is already a number), and a few
classifiers lack it — but where present, it is invaluable.
A probability is a confidence, not a guarantee
predict_proba reports the model's estimated confidence given what it
learned — it is not a cosmic truth. A model can be confidently wrong,
especially on data unlike anything it trained on. Treat these numbers as a
useful signal to reason with, not as certified probabilities. Calibrating
them so they mean what they claim is its own subtle topic.
Putting the whole API together
One more example to see the full vocabulary working in concert: fit a classifier, get hard predictions, get probabilities, and score — all on the same model, all with the standard methods.
Four method calls, and you have trained a model, made predictions, gauged its
confidence, and measured its quality. Every one of those calls — fit,
predict, predict_proba, score — looks the same no matter which
estimator you swap in. That is the API you learn once and use forever.
Common misconceptions
- "Each model needs its own special method names." The opposite is the
whole point.
fit,predict,transform,scoreare shared across the entire library — that consistency is scikit-learn's defining feature. - "
fitreturns the predictions." No.fittrains the estimator and returns the estimator itself (so calls can be chained). Predictions come from a separatepredictcall afterward. - "
scoreis a thorough evaluation." It is a single convenience metric with a fixed default. Serious evaluation usually needs the richer metrics in the evaluation chapters;scoreis a quick check, not the last word. - "
predict_probagives true, calibrated probabilities." It gives the model's estimated confidence, which can be miscalibrated. Useful, but not gospel. - "Transformers and predictors are unrelated." Both are estimators and
both
fit. They differ only in what they produce afterward — reshaped data (transform) versus answers (predict).
Real-world applications
The consistent API is not an academic nicety; it is what makes real machine learning workflows tractable. Because every estimator shares the interface:
- Model comparison is trivial. You can loop over a list of models, calling
the same
fit/scoreon each, and rank them — without special-casing any algorithm. This is the backbone of model selection. - Pipelines just work. Transformers (
fit/transform) and a final model (fit/predict) snap together into one object because they share the contract. The pipelines chapter relies entirely on this. - Tuning is automated. Tools like
GridSearchCVcan tune any estimator through the same interface, which is why hyperparameter search in scikit-learn is a few lines regardless of the model. - Knowledge compounds. Every new algorithm you learn — trees, forests,
gradient boosting — arrives already speaking
fit/predict. You learn its ideas, never a new way to call it.
That leverage — learn the interface once, apply it everywhere — is why this page sits in the foundations. The rest of the course will introduce many models, but you already know how to operate every single one of them.
Your turn
The challenge asks you to drive a given estimator through the standard
lifecycle: fit it, predict with it, and score it. The specific model is
provided; the point is that the interface is the thing you have learned, and
it works the same regardless of which estimator sits behind it.
A classifier object model (a KNeighborsClassifier) and a
train/test split (X_train, X_test, y_train, y_test from the wine dataset)
are already provided. Drive the model through the standard scikit-learn API.
- Train the model on the training data using
.fit(...). - Use
.predict(...)onX_testand store the resulting array inpredictions. - Use
.predict_proba(...)onX_testand store the result inprobs. - Use
.score(...)on the test set and store the number inaccuracy.
The hidden tests check that the model is fitted, that predictions has one
label per test row, that each row of probs sums to 1, and that
accuracy matches scoring on the test set.
Check your understanding
What is scikit-learn's "consistent API" and why does it matter?
A rule that every model must achieve the same accuracy
A requirement that all data be in NumPy arrays
The fact that nearly every estimator shares the same methods (fit, predict, transform, score), so learning to use one model teaches you to use them all
A guarantee that models never make mistakes
What does .fit(X, y) do, and what does it return?
It returns the predictions for X
It returns the accuracy score on X and y
It trains the estimator (learning from the data) and returns the estimator itself; predictions come later from a separate .predict() call
It splits the data into training and test sets
How do you tell a transformer (like StandardScaler) from a predictor (like LinearRegression) by their methods?
Transformers have fit; predictors do not
Transformers run faster than predictors
A predictor exposes .predict() and returns answers; a transformer exposes .transform() and returns reshaped data — though both are estimators that fit first
Predictors return data; transformers return predictions
A classifier's .predict() returns the label "spam". What does .predict_proba() return for the same input, and what is it good for?
The same label "spam" again, just formatted differently
The accuracy of the model on that single row
A probability for each class (summing to 1), useful for ranking by confidence, setting custom decision thresholds, and drawing ROC curves
The training data the model memorized
You call model.score(X_train, y_train) and get a very high number. Why is this not a trustworthy measure of the model's quality?
score only works on test data and will raise an error here
A high score always means the model is excellent
score happily reports on whatever data you pass it, and scoring on the training data measures memorization, not the model's ability to generalize to new data
score returns R² for classifiers, which is meaningless
Which statement about predict_proba is most accurate?
Every scikit-learn estimator, including regressors, provides it
The probabilities it returns are always perfectly calibrated truths
It is available on many classifiers (not all, and not on regressors), and its numbers are the model's estimated confidence, which can be miscalibrated
It returns the single most likely class label, like predict
Regression, Classification, and Clustering
The three core task types of classical machine learning, told apart by one thing — what comes out. A number means regression, a category means classification, and groups with no labels mean clustering. We see all three in code and pictures.
The Train/Test Split
Why we hide data from our own models — the single most important habit in machine learning, and the foundation of every honest evaluation.