The scikit-learn API

scikit-learn's quiet superpower is consistency. Every model — linear regression, nearest neighbors, k-means, and hundreds more — wears the same interface. Learn fit, predict, transform, score, and predict_proba once, and you know how to drive them all.

You have now trained several models without us ever dwelling on how you called them. That was deliberate, and it reveals scikit-learn's best-kept secret: you have been using the same handful of methods the entire time. fit to train. predict to get answers. score to evaluate. Whether the model was a k-nearest-neighbors classifier, a logistic regression, or a k-means clusterer, the calls looked nearly identical.

This is not an accident — it is the single design decision that makes scikit-learn a joy to use. There are hundreds of algorithms in the library, and they all speak the same small vocabulary. Learn that vocabulary once and you can pick up any model in the ecosystem without learning a new interface. This page is about that vocabulary: the estimator API.

Why a consistent API is a superpower

Imagine if every model came with its own bespoke method names — one wanted train(), another learn(), a third optimize(). You would spend your life reading documentation just to call things. scikit-learn refuses that chaos. Every estimator follows the same contract, so the knowledge you build on your first model transfers, unchanged, to your thousandth. The API is the thing you actually learn once and reuse forever.

The core vocabulary

Almost everything in scikit-learn is an estimator — an object that learns something from data. Estimators share a tiny set of methods, and you only need a few to be productive:

.fit(X, y) — learn from data. This is training. Supervised estimators take features and a target; unsupervised ones take just X. After fit, the estimator holds what it learned (its tuned parameters).
.predict(X) — produce answers for new data. Used by models that output a prediction per row: regressors return numbers, classifiers return class labels, clusterers return group ids.
.transform(X) — produce a transformed version of the data. Used by preprocessing steps (scalers, encoders) that reshape or rescale features rather than predict an answer.
.score(X, y) — report a quick quality number. A built-in default metric: accuracy for classifiers, R² for regressors. Convenient for a fast check, though the metrics chapters will give you sharper tools.
.predict_proba(X) — for many classifiers, the predicted probability of each class, not just the single hard label.

The rhythm is almost always the same two beats: fit, then use. You fit on training data, then predict (or transform, or score) on new data.

The lifecycle never changes

Every model you will ever meet in this course follows the same two-step lifecycle: fit once on training data, then call predict (or transform, or score) as often as you like on new data. Internalize this rhythm and new models stop being intimidating — you already know how to drive them.

Proof: the same shape across three very different models

Talk is cheap. Let us prove the claim by running three genuinely different algorithms — a regressor, a classifier, and a clusterer — and watching the calls line up. These models work in completely different ways internally, yet from the outside they are nearly indistinguishable.

Read that again and let it land. Three models with nothing in common under the hood — one fits a line, one memorizes neighbors, one finds centroids — and yet the code is the same two methods each time: fit, then predict. The only difference is what comes out (a number, a class, a group). This uniformity is exactly why, once you have trained one scikit-learn model, you have effectively trained them all.

What this buys you

Because the interface is uniform, swapping one model for another is often a one-line change — replace KNeighborsClassifier() with LogisticRegression() and the rest of your code is untouched. This makes experimenting with different algorithms almost free, which is a huge part of why scikit-learn became the standard for classical machine learning.

A quick check

QuestionSelect one

Why is it often a one-line change to swap a KNeighborsClassifier for a LogisticRegression in scikit-learn?

Because the two algorithms work the same way internally

Because scikit-learn automatically rewrites your code

Because all estimators share the same interface (fit, predict, score), so the surrounding code does not need to change when you swap the model

Because both models always give identical predictions

Estimators vs. transformers

There is one important distinction inside the estimator family. Some estimators predict an answer; others transform the data. The method they expose tells you which kind you are holding.

A predictor (regressor, classifier, clusterer) has a .predict(). You give it features, it gives you an answer — a number, a label, a group.
A transformer (scaler, encoder, dimensionality reducer) has a .transform(). You give it features, it gives you reshaped features — scaled, encoded, or compressed — ready to feed into a model.

Both are estimators, and both fit first: a scaler must learn the columns' means and spreads before it can rescale, just as a model must learn before it can predict. The difference is purely what they produce afterward. Let us see a transformer — StandardScaler, which rescales each feature to have mean 0 and unit variance — and confirm it exposes transform, not predict.

A transformer returns data; a predictor returns answers. That is the whole distinction. Transformers shine when you need to prepare features before modeling — and because they share the fit/transform interface, they chain together cleanly with models inside a Pipeline, which the pipelines and ColumnTransformer chapter is devoted to.

fit_transform is a convenience, not magic

You will often see scaler.fit_transform(X). It simply does fit then transform in one call — handy, but identical to doing them separately. One caution: only ever fit (or fit_transform) a transformer on your training data, then transform the test data with that already-fitted transformer. Fitting on the test data leaks information across the split — the exact leak the train/test page warned about. Pipelines exist largely to make this correct by default.

`score`: a quick quality number

Every predictor offers a .score(X, y) that returns a single default metric, so you can sanity-check a model in one line. The default differs by task: classifiers return accuracy (fraction correct), regressors return R² (the fraction of the target's variance the model explains, where 1.0 is perfect).

score is perfect for a fast gut check, but it is intentionally simple — one number, one default metric. Real evaluation often needs more nuance (precision and recall for imbalanced classes, mean absolute error for regression, and so on). The classification metrics and regression metrics chapters unpack when score's default is enough and when it quietly misleads.

Always score on held-out data

score does not care what data you hand it — it will happily report a flattering number on the training set. Calling model.score(X_train, y_train) measures memory, not skill. To get an honest reading, score on data the model never trained on, exactly as the train/test page insisted.

`predict_proba`: probabilities, not just labels

A hard label — "spam" — throws away useful information. How sure is the model? Many classifiers can tell you, via .predict_proba(X), which returns the estimated probability of each class. The probabilities across the classes sum to 1 for each row.

Those probabilities are the raw material for so much that follows: setting a custom decision threshold (flag fraud only above 80% confidence), ranking predictions by confidence, and drawing ROC curves. The ROC curves and AUC chapter builds directly on predict_proba. Note that not every model exposes it — regressors do not (their output is already a number), and a few classifiers lack it — but where present, it is invaluable.

A probability is a confidence, not a guarantee

predict_proba reports the model's estimated confidence given what it learned — it is not a cosmic truth. A model can be confidently wrong, especially on data unlike anything it trained on. Treat these numbers as a useful signal to reason with, not as certified probabilities. Calibrating them so they mean what they claim is its own subtle topic.

Putting the whole API together

One more example to see the full vocabulary working in concert: fit a classifier, get hard predictions, get probabilities, and score — all on the same model, all with the standard methods.

Four method calls, and you have trained a model, made predictions, gauged its confidence, and measured its quality. Every one of those calls — fit, predict, predict_proba, score — looks the same no matter which estimator you swap in. That is the API you learn once and use forever.

Common misconceptions

"Each model needs its own special method names." The opposite is the whole point. fit, predict, transform, score are shared across the entire library — that consistency is scikit-learn's defining feature.
"fit returns the predictions." No. fit trains the estimator and returns the estimator itself (so calls can be chained). Predictions come from a separate predict call afterward.
"score is a thorough evaluation." It is a single convenience metric with a fixed default. Serious evaluation usually needs the richer metrics in the evaluation chapters; score is a quick check, not the last word.
"predict_proba gives true, calibrated probabilities." It gives the model's estimated confidence, which can be miscalibrated. Useful, but not gospel.
"Transformers and predictors are unrelated." Both are estimators and both fit. They differ only in what they produce afterward — reshaped data (transform) versus answers (predict).

Real-world applications

The consistent API is not an academic nicety; it is what makes real machine learning workflows tractable. Because every estimator shares the interface:

Model comparison is trivial. You can loop over a list of models, calling the same fit/score on each, and rank them — without special-casing any algorithm. This is the backbone of model selection.
Pipelines just work. Transformers (fit/transform) and a final model (fit/predict) snap together into one object because they share the contract. The pipelines chapter relies entirely on this.
Tuning is automated. Tools like GridSearchCV can tune any estimator through the same interface, which is why hyperparameter search in scikit-learn is a few lines regardless of the model.
Knowledge compounds. Every new algorithm you learn — trees, forests, gradient boosting — arrives already speaking fit/predict. You learn its ideas, never a new way to call it.

That leverage — learn the interface once, apply it everywhere — is why this page sits in the foundations. The rest of the course will introduce many models, but you already know how to operate every single one of them.

Your turn

The challenge asks you to drive a given estimator through the standard lifecycle: fit it, predict with it, and score it. The specific model is provided; the point is that the interface is the thing you have learned, and it works the same regardless of which estimator sits behind it.

A classifier object model (a KNeighborsClassifier) and a train/test split (X_train, X_test, y_train, y_test from the wine dataset) are already provided. Drive the model through the standard scikit-learn API.

Train the model on the training data using .fit(...).
Use .predict(...) on X_test and store the resulting array in predictions.
Use .predict_proba(...) on X_test and store the result in probs.
Use .score(...) on the test set and store the number in accuracy.

The hidden tests check that the model is fitted, that predictions has one label per test row, that each row of probs sums to 1, and that accuracy matches scoring on the test set.

Check your understanding

QuestionSelect one

What is scikit-learn's "consistent API" and why does it matter?

A rule that every model must achieve the same accuracy

A requirement that all data be in NumPy arrays

The fact that nearly every estimator shares the same methods (fit, predict, transform, score), so learning to use one model teaches you to use them all

A guarantee that models never make mistakes

QuestionSelect one

What does .fit(X, y) do, and what does it return?

It returns the predictions for X

It returns the accuracy score on X and y

It trains the estimator (learning from the data) and returns the estimator itself; predictions come later from a separate .predict() call

It splits the data into training and test sets

QuestionSelect one

How do you tell a transformer (like StandardScaler) from a predictor (like LinearRegression) by their methods?

Transformers have fit; predictors do not

Transformers run faster than predictors

A predictor exposes .predict() and returns answers; a transformer exposes .transform() and returns reshaped data — though both are estimators that fit first

Predictors return data; transformers return predictions

QuestionSelect one

A classifier's .predict() returns the label "spam". What does .predict_proba() return for the same input, and what is it good for?

The same label "spam" again, just formatted differently

The accuracy of the model on that single row

A probability for each class (summing to 1), useful for ranking by confidence, setting custom decision thresholds, and drawing ROC curves

The training data the model memorized

QuestionSelect one

You call model.score(X_train, y_train) and get a very high number. Why is this not a trustworthy measure of the model's quality?

score only works on test data and will raise an error here

A high score always means the model is excellent

score happily reports on whatever data you pass it, and scoring on the training data measures memorization, not the model's ability to generalize to new data

score returns R² for classifiers, which is meaningless

QuestionSelect one

Which statement about predict_proba is most accurate?

Every scikit-learn estimator, including regressors, provides it

The probabilities it returns are always perfectly calibrated truths

It is available on many classifiers (not all, and not on regressors), and its numbers are the model's estimated confidence, which can be miscalibrated

It returns the single most likely class label, like predict

The scikit-learn API

On this page