K-Nearest Neighbors

The laziest algorithm in machine learning — it does no real training at all. To predict, it simply looks at the closest examples it has seen and takes a vote. Simple, intuitive, and a perfect lens for understanding distance and feature scaling.

Most models work hard up front. A linear regression solves for coefficients; a decision tree searches for the best splits. By the time training finishes, the original data can be thrown away — the model is the learned formula or flowchart.

K-nearest neighbors (KNN) does the opposite, and it is almost comically simple. To predict the label of a new point, it asks one question: which training examples are closest to you? It finds the k nearest ones and lets them vote. For classification, the majority class wins. For regression, it averages their values. That is the entire algorithm. There is no equation to fit, no tree to grow — KNN just memorizes the training set and consults it at prediction time.

This makes KNN the canonical example of instance-based or lazy learning, and it is one of the best teaching tools in all of machine learning. It forces you to confront two ideas you will use forever: what it means for examples to be "close," and why feature scaling can make or break a distance-based model.

The intuition: you are who you sit next to

The premise is something you already believe about the world. If you want to guess whether a fruit is an apple or an orange, and the three fruits most similar to it in size, color, and weight are all apples, you would guess apple. If you want to estimate a house's price and the five most similar houses nearby sold for around 400,000, you would guess about 400,000. KNN formalizes exactly this instinct: similar inputs tend to have similar outputs.

The diagram shows a single prediction. The new point compares itself to every stored example, keeps the five nearest, and reports the majority label among them. Change k and you change how many neighbors get a say.

Why 'lazy' is a technical term, not an insult

KNN is called a lazy learner because it defers all the work to prediction time. fit does almost nothing — it just stores the data (and builds an index to find neighbors quickly). Compare this to a lazy student who crams during the exam rather than studying beforehand. The cost is not free: a lazy learner can be slow and memory-hungry when you actually ask it to predict, especially on large datasets.

What does "closest" mean? Distance

"Nearest" requires a notion of distance. By default KNN uses Euclidean distance — the straight-line distance you would measure with a ruler, the Pythagorean theorem extended to as many dimensions as you have features. Two examples are close if the sum of their squared differences across all features is small.

This is intuitive, but it hides a trap that we will spend most of this page on. Euclidean distance adds up differences across features as if every feature were measured in the same units. It almost never is — and that is where scaling comes in.

Your first KNN classifier

Let us classify iris flowers. The API is, as always, identical to every other scikit-learn estimator.

Notice that fit returns instantly: there is nothing to optimize. All the work happens inside score (and predict), where each test flower searches the training set for its five nearest neighbors and takes their majority vote. On the well-separated iris data, this naive strategy is remarkably accurate.

The dial that controls everything: choosing `k`

k (the argument n_neighbors) is KNN's single most important knob, and it directly controls the bias-variance tradeoff you met earlier.

Small k (e.g. 1). Each prediction is decided by a single closest neighbor. The decision boundary becomes jagged and wraps tightly around individual points — including noisy ones. This is low bias, high variance: the model is sensitive to every quirk in the data and tends to overfit.
Large k (e.g. 50). Each prediction is an average over many neighbors, which smooths the boundary. Too large, and predictions blur toward the overall majority class, ignoring local structure. This is high bias, low variance: the model underfits.

Let us watch test accuracy respond to k.

Look at k=1: the training accuracy is a perfect 1.000. That is not skill — with one neighbor, the closest point to any training example is itself, so it always votes correctly. The test score is the honest one, and it is lower. As k grows, training accuracy comes down to earth while test accuracy rises to a plateau, then eventually falls as the model over-smooths. The best k lives in the middle, and you find it properly with cross-validation (its own chapter), not by peeking at the test set.

A useful default and an odd-number trick

A common starting point is k around the square root of the number of training samples, then refine with cross-validation. For binary classification, prefer an odd k so a two-class vote cannot tie. Ties are broken arbitrarily otherwise, which adds noise you do not want.

The make-or-break issue: feature scaling

Here is the single most important thing to know about KNN, and the reason it is taught so early: KNN is distance-based, so the scale of your features matters enormously. Euclidean distance sums squared differences across features. If one feature ranges over 0–1000 and another over 0–1, the large-range feature utterly dominates the distance, and the small one is ignored — no matter how informative it actually is.

Let us prove it. We will build a dataset with one genuinely useful feature measured in small units and one pure-noise feature measured in huge units, then run KNN with and without scaling.

The unscaled model is barely better than guessing: the noise feature's enormous range drowns out the perfectly predictive feature, so "nearest" neighbors are chosen almost entirely by noise. After StandardScaler puts both features on comparable footing (mean 0, standard deviation 1), the useful feature can actually influence which neighbors are nearest, and accuracy leaps. Same data, same algorithm, same k — the only difference is scaling.

Always scale features before KNN

For any distance-based model — KNN, K-means, and others — standardize your features first, or the model silently weights them by their arbitrary units. A feature measured in dollars will crush a feature measured in fractions, not because it matters more, but because its numbers are bigger. This is the most common KNN mistake by far.

The right way: scaling inside a Pipeline

Scaling manually is error-prone — it is easy to accidentally fit the scaler on the whole dataset (a data leak) or forget to apply the same transform to the test set. The clean solution is a Pipeline, which chains the scaler and the model into one object that does the right thing automatically. Pipelines get a full chapter; here is the pattern so you see how naturally it fits KNN.

The improvement is dramatic, and the pipeline guarantees the scaler is fit on the training fold only and reused for the test fold — exactly the leak-free habit the train/test split page warned you to protect. From here on, treat "KNN" and "KNN inside a scaling pipeline" as the same thing.

KNN for regression

Swap the vote for an average and you have KNN regression. To predict a number for a new point, find its k nearest neighbors and return the mean of their target values. The same scaling rules apply, and the same k tradeoff governs smoothness.

The picture tells the whole bias-variance story in one frame. With k=1, the prediction line zig-zags through every single point — it interpolates the noise (high variance). With k=25, the line is so smooth it flattens out the genuine curve and pulls toward the global average (high bias). The middle value, k=5, tracks the real sine shape without chasing the noise.

The same picture you saw for trees and k

Small k here plays the role that large max_depth played for trees: maximum flexibility, maximum overfitting. Large k plays the role of a stump: maximum smoothing, maximum underfitting. Different algorithms, same universal tradeoff. Once you recognize this pattern, every model's main knob becomes easier to reason about.

The curse of dimensionality

KNN has a subtler weakness that grows with the number of features: the curse of dimensionality. In high dimensions, something deeply unintuitive happens — all points become roughly equally far from one another. The distance to your nearest neighbor and the distance to your farthest neighbor converge, so "nearest" stops carrying much meaning, and KNN's core assumption quietly breaks down.

In 2 dimensions the farthest point is several times farther than the nearest — neighbors are meaningfully different distances away. By 1000 dimensions the ratio collapses toward 1: everything is about the same distance from everything else. When that happens, the "nearest" neighbors are barely nearer than random points, and KNN's predictions degrade. The practical fixes are to reduce dimensionality (select informative features, or use techniques covered later) and to scale, but the deeper lesson is that distance-based reasoning does not transfer for free to very high-dimensional data.

When to use KNN — and when not to

KNN shines when:

You want a dead-simple, assumption-light baseline. It makes no assumption about the shape of the decision boundary, so it can fit irregular regions that a linear model cannot.
The dataset is small-to-moderate and low-dimensional, with meaningfully scaled features.
Local structure is the signal. Recommendation-style "find me similar items" problems map naturally onto nearest-neighbor search.

Avoid KNN when:

The dataset is large. Because prediction compares each query to every stored point, predicting is slow and memory-hungry as data grows. Training is instant, but that is the wrong place to save time.
There are many features. The curse of dimensionality erodes the meaning of distance; trees and linear models cope far better.
You forgot to scale. An unscaled KNN is often worse than a trivial baseline, as the experiment above showed.

Common misconceptions about KNN

"KNN does no work, so it must be cheap." It is cheap to train and expensive to predict — exactly backwards from most models. The data also has to live in memory.
"The 'K' in KNN is the same as the 'K' in K-means." Different ks entirely. In KNN, k is the number of neighbors that vote. In K-means (clustering, a later chapter), k is the number of clusters. They share a letter, nothing more.
"Scaling is optional." For a distance-based model it is essential, not cosmetic.
"Bigger k is safer." Too-large k underfits by washing out local structure, just as too-small k overfits. There is a sweet spot.

Real-world applications

The nearest-neighbor idea powers more systems than its simplicity might suggest:

Recommendation and similarity search. "Customers similar to you also bought..." and "items like this one" are nearest-neighbor queries at heart (modern systems use fast approximate variants over learned embeddings).
Image and document retrieval. Find the most visually or semantically similar items by nearest neighbors in a feature space.
Anomaly detection. A point with no close neighbors is, by definition, unusual — a natural outlier flag.
Quick baselines and imputation. KNN is a sensible first model to beat, and KNNImputer fills missing values from similar rows.

The reason "find the most similar examples" endures is that it needs almost no assumptions. When you do not know the shape of the relationship, letting the closest data speak is a sturdy first move.

Your turn

The wine dataset has features on wildly different scales, so unscaled KNN does poorly — and scaling fixes it. Show both.

Load the data with load_wine(return_X_y=True) into X and y.
Split into train/test with 30% in the test set, random_state=0, and stratified on y.
Train a plain KNeighborsClassifier(n_neighbors=5) on the raw training data and store its test accuracy in raw_acc.
Build a pipeline with make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)), fit it on the training data, and store its test accuracy in scaled_acc. Name the pipeline pipe.

The hidden tests check that pipe really contains a StandardScaler, that the scaled accuracy is high (above 0.92), and — the key lesson — that scaling improved the test accuracy (scaled_acc > raw_acc).

Check your understanding

QuestionSelect one

Why is K-nearest neighbors described as a "lazy" or "instance-based" learner?

It always uses very few features to save effort

It does essentially no work during fit — it just stores the training data — and defers all computation to prediction time, where it searches for the nearest examples

It refuses to train on datasets larger than a few hundred rows

It picks a random subset of neighbors to avoid computation

QuestionSelect one

You run KNN with n_neighbors=1 and get 100% accuracy on the training set but mediocre accuracy on the test set. Why is the perfect training score unsurprising?

The training labels must be leaking into the model

With k=1, the single nearest point to any training example is the example itself, so it always votes for the correct label — training accuracy is trivially perfect and tells you nothing about generalization

A bug in scikit-learn inflates k=1 scores

It proves k=1 is the optimal choice

QuestionSelect one

A dataset has one feature measured in dollars (range 0 to 100000) and one measured as a fraction (range 0 to 1). You run KNN without scaling. What goes wrong?

KNN cannot run on features with different units and will raise an error

Nothing — KNN automatically normalizes features internally

Euclidean distance is dominated by the dollars feature because its numbers are far larger, so the fraction feature is effectively ignored regardless of how informative it is

The fraction feature dominates because smaller numbers count more

QuestionSelect one

How does increasing k (the number of neighbors) generally affect the bias-variance tradeoff in KNN?

Larger k increases variance and produces a more jagged boundary

Larger k has no effect on bias or variance

Larger k averages over more neighbors, which smooths the decision boundary — reducing variance but increasing bias (eventually underfitting)

Larger k always improves test accuracy without limit

QuestionSelect one

What is the "curse of dimensionality" as it relates to KNN?

KNN cannot accept more than a fixed number of features

Adding features always makes KNN faster but less accurate

As the number of features grows large, distances between points become nearly uniform — the nearest and farthest neighbors are almost equally far away — so "nearest" loses meaning and KNN's core assumption weakens

It refers to KNN using too much memory on small datasets

K-Nearest Neighbors

On this page