K-Nearest Neighbors
The laziest algorithm in machine learning — it does no real training at all. To predict, it simply looks at the closest examples it has seen and takes a vote. Simple, intuitive, and a perfect lens for understanding distance and feature scaling.
Most models work hard up front. A linear regression solves for coefficients; a decision tree searches for the best splits. By the time training finishes, the original data can be thrown away — the model is the learned formula or flowchart.
K-nearest neighbors (KNN) does the opposite, and it is almost comically
simple. To predict the label of a new point, it asks one question: which
training examples are closest to you? It finds the k nearest ones and
lets them vote. For classification, the majority class wins. For
regression, it averages their values. That is the entire algorithm. There
is no equation to fit, no tree to grow — KNN just memorizes the training
set and consults it at prediction time.
This makes KNN the canonical example of instance-based or lazy learning, and it is one of the best teaching tools in all of machine learning. It forces you to confront two ideas you will use forever: what it means for examples to be "close," and why feature scaling can make or break a distance-based model.
The intuition: you are who you sit next to
The premise is something you already believe about the world. If you want to guess whether a fruit is an apple or an orange, and the three fruits most similar to it in size, color, and weight are all apples, you would guess apple. If you want to estimate a house's price and the five most similar houses nearby sold for around 400,000, you would guess about 400,000. KNN formalizes exactly this instinct: similar inputs tend to have similar outputs.
The diagram shows a single prediction. The new point compares itself to
every stored example, keeps the five nearest, and reports the majority
label among them. Change k and you change how many neighbors get a say.
Why 'lazy' is a technical term, not an insult
KNN is called a lazy learner because it defers all the work to prediction
time. fit does almost nothing — it just stores the data (and builds an
index to find neighbors quickly). Compare this to a lazy student who
crams during the exam rather than studying beforehand. The cost is not
free: a lazy learner can be slow and memory-hungry when you actually ask it
to predict, especially on large datasets.
What does "closest" mean? Distance
"Nearest" requires a notion of distance. By default KNN uses Euclidean distance — the straight-line distance you would measure with a ruler, the Pythagorean theorem extended to as many dimensions as you have features. Two examples are close if the sum of their squared differences across all features is small.
This is intuitive, but it hides a trap that we will spend most of this page on. Euclidean distance adds up differences across features as if every feature were measured in the same units. It almost never is — and that is where scaling comes in.
Your first KNN classifier
Let us classify iris flowers. The API is, as always, identical to every other scikit-learn estimator.
Notice that fit returns instantly: there is nothing to optimize. All the
work happens inside score (and predict), where each test flower searches
the training set for its five nearest neighbors and takes their majority
vote. On the well-separated iris data, this naive strategy is remarkably
accurate.
The dial that controls everything: choosing k
k (the argument n_neighbors) is KNN's single most important knob, and
it directly controls the bias-variance tradeoff you met earlier.
- Small
k(e.g. 1). Each prediction is decided by a single closest neighbor. The decision boundary becomes jagged and wraps tightly around individual points — including noisy ones. This is low bias, high variance: the model is sensitive to every quirk in the data and tends to overfit. - Large
k(e.g. 50). Each prediction is an average over many neighbors, which smooths the boundary. Too large, and predictions blur toward the overall majority class, ignoring local structure. This is high bias, low variance: the model underfits.
Let us watch test accuracy respond to k.
Look at k=1: the training accuracy is a perfect 1.000. That is not
skill — with one neighbor, the closest point to any training example is
itself, so it always votes correctly. The test score is the honest one,
and it is lower. As k grows, training accuracy comes down to earth while
test accuracy rises to a plateau, then eventually falls as the model
over-smooths. The best k lives in the middle, and you find it properly
with cross-validation (its own chapter), not by peeking at the test set.
A useful default and an odd-number trick
A common starting point is k around the square root of the number of
training samples, then refine with cross-validation. For binary
classification, prefer an odd k so a two-class vote cannot tie. Ties
are broken arbitrarily otherwise, which adds noise you do not want.
The make-or-break issue: feature scaling
Here is the single most important thing to know about KNN, and the reason it is taught so early: KNN is distance-based, so the scale of your features matters enormously. Euclidean distance sums squared differences across features. If one feature ranges over 0–1000 and another over 0–1, the large-range feature utterly dominates the distance, and the small one is ignored — no matter how informative it actually is.
Let us prove it. We will build a dataset with one genuinely useful feature measured in small units and one pure-noise feature measured in huge units, then run KNN with and without scaling.
The unscaled model is barely better than guessing: the noise feature's
enormous range drowns out the perfectly predictive feature, so "nearest"
neighbors are chosen almost entirely by noise. After
StandardScaler puts both features on comparable footing (mean 0, standard
deviation 1), the useful feature can actually influence which neighbors are
nearest, and accuracy leaps. Same data, same algorithm, same k — the
only difference is scaling.
Always scale features before KNN
For any distance-based model — KNN, K-means, and others — standardize your features first, or the model silently weights them by their arbitrary units. A feature measured in dollars will crush a feature measured in fractions, not because it matters more, but because its numbers are bigger. This is the most common KNN mistake by far.
The right way: scaling inside a Pipeline
Scaling manually is error-prone — it is easy to accidentally fit the scaler on the whole dataset (a data leak) or forget to apply the same transform to the test set. The clean solution is a Pipeline, which chains the scaler and the model into one object that does the right thing automatically. Pipelines get a full chapter; here is the pattern so you see how naturally it fits KNN.
The improvement is dramatic, and the pipeline guarantees the scaler is fit on the training fold only and reused for the test fold — exactly the leak-free habit the train/test split page warned you to protect. From here on, treat "KNN" and "KNN inside a scaling pipeline" as the same thing.
KNN for regression
Swap the vote for an average and you have KNN regression. To predict a
number for a new point, find its k nearest neighbors and return the mean
of their target values. The same scaling rules apply, and the same k
tradeoff governs smoothness.
The picture tells the whole bias-variance story in one frame. With k=1,
the prediction line zig-zags through every single point — it interpolates
the noise (high variance). With k=25, the line is so smooth it flattens
out the genuine curve and pulls toward the global average (high bias). The
middle value, k=5, tracks the real sine shape without chasing the noise.
The same picture you saw for trees and k
Small k here plays the role that large max_depth played for trees:
maximum flexibility, maximum overfitting. Large k plays the role of a
stump: maximum smoothing, maximum underfitting. Different algorithms, same
universal tradeoff. Once you recognize this pattern, every model's main
knob becomes easier to reason about.
The curse of dimensionality
KNN has a subtler weakness that grows with the number of features: the curse of dimensionality. In high dimensions, something deeply unintuitive happens — all points become roughly equally far from one another. The distance to your nearest neighbor and the distance to your farthest neighbor converge, so "nearest" stops carrying much meaning, and KNN's core assumption quietly breaks down.
In 2 dimensions the farthest point is several times farther than the nearest — neighbors are meaningfully different distances away. By 1000 dimensions the ratio collapses toward 1: everything is about the same distance from everything else. When that happens, the "nearest" neighbors are barely nearer than random points, and KNN's predictions degrade. The practical fixes are to reduce dimensionality (select informative features, or use techniques covered later) and to scale, but the deeper lesson is that distance-based reasoning does not transfer for free to very high-dimensional data.
When to use KNN — and when not to
KNN shines when:
- You want a dead-simple, assumption-light baseline. It makes no assumption about the shape of the decision boundary, so it can fit irregular regions that a linear model cannot.
- The dataset is small-to-moderate and low-dimensional, with meaningfully scaled features.
- Local structure is the signal. Recommendation-style "find me similar items" problems map naturally onto nearest-neighbor search.
Avoid KNN when:
- The dataset is large. Because prediction compares each query to every stored point, predicting is slow and memory-hungry as data grows. Training is instant, but that is the wrong place to save time.
- There are many features. The curse of dimensionality erodes the meaning of distance; trees and linear models cope far better.
- You forgot to scale. An unscaled KNN is often worse than a trivial baseline, as the experiment above showed.
Common misconceptions about KNN
- "KNN does no work, so it must be cheap." It is cheap to train and expensive to predict — exactly backwards from most models. The data also has to live in memory.
- "The 'K' in KNN is the same as the 'K' in K-means." Different
ks entirely. In KNN,kis the number of neighbors that vote. In K-means (clustering, a later chapter),kis the number of clusters. They share a letter, nothing more. - "Scaling is optional." For a distance-based model it is essential, not cosmetic.
- "Bigger k is safer." Too-large
kunderfits by washing out local structure, just as too-smallkoverfits. There is a sweet spot.
Real-world applications
The nearest-neighbor idea powers more systems than its simplicity might suggest:
- Recommendation and similarity search. "Customers similar to you also bought..." and "items like this one" are nearest-neighbor queries at heart (modern systems use fast approximate variants over learned embeddings).
- Image and document retrieval. Find the most visually or semantically similar items by nearest neighbors in a feature space.
- Anomaly detection. A point with no close neighbors is, by definition, unusual — a natural outlier flag.
- Quick baselines and imputation. KNN is a sensible first model to beat,
and
KNNImputerfills missing values from similar rows.
The reason "find the most similar examples" endures is that it needs almost no assumptions. When you do not know the shape of the relationship, letting the closest data speak is a sturdy first move.
Your turn
The wine dataset has features on wildly different scales, so unscaled KNN does poorly — and scaling fixes it. Show both.
- Load the data with
load_wine(return_X_y=True)intoXandy. - Split into train/test with 30% in the test set,
random_state=0, and stratified ony. - Train a plain
KNeighborsClassifier(n_neighbors=5)on the raw training data and store its test accuracy inraw_acc. - Build a pipeline with
make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5)), fit it on the training data, and store its test accuracy inscaled_acc. Name the pipelinepipe.
The hidden tests check that pipe really contains a StandardScaler,
that the scaled accuracy is high (above 0.92), and — the key lesson — that
scaling improved the test accuracy (scaled_acc > raw_acc).
Check your understanding
Why is K-nearest neighbors described as a "lazy" or "instance-based" learner?
It always uses very few features to save effort
It does essentially no work during fit — it just stores the training data — and defers all computation to prediction time, where it searches for the nearest examples
It refuses to train on datasets larger than a few hundred rows
It picks a random subset of neighbors to avoid computation
You run KNN with n_neighbors=1 and get 100% accuracy on the training set but mediocre accuracy on the test set. Why is the perfect training score unsurprising?
The training labels must be leaking into the model
With k=1, the single nearest point to any training example is the example itself, so it always votes for the correct label — training accuracy is trivially perfect and tells you nothing about generalization
A bug in scikit-learn inflates k=1 scores
It proves k=1 is the optimal choice
A dataset has one feature measured in dollars (range 0 to 100000) and one measured as a fraction (range 0 to 1). You run KNN without scaling. What goes wrong?
KNN cannot run on features with different units and will raise an error
Nothing — KNN automatically normalizes features internally
Euclidean distance is dominated by the dollars feature because its numbers are far larger, so the fraction feature is effectively ignored regardless of how informative it is
The fraction feature dominates because smaller numbers count more
How does increasing k (the number of neighbors) generally affect the bias-variance tradeoff in KNN?
Larger k increases variance and produces a more jagged boundary
Larger k has no effect on bias or variance
Larger k averages over more neighbors, which smooths the decision boundary — reducing variance but increasing bias (eventually underfitting)
Larger k always improves test accuracy without limit
What is the "curse of dimensionality" as it relates to KNN?
KNN cannot accept more than a fixed number of features
Adding features always makes KNN faster but less accurate
As the number of features grows large, distances between points become nearly uniform — the nearest and farthest neighbors are almost equally far away — so "nearest" loses meaning and KNN's core assumption weakens
It refers to KNN using too much memory on small datasets
Decision Trees
A model you can read like a flowchart — a cascade of yes/no questions that splits the data into ever-purer groups. Intuitive, requires no scaling, and the gateway to random forests.
Ensembles and Random Forests
The wisdom of crowds, applied to models. One decision tree is clever but unstable; average hundreds of diverse trees and you get one of the most reliable, hardest-to-beat models in all of tabular machine learning.