K-Means Clustering
The flagship unsupervised algorithm — how a simple loop of "assign points, move centers, repeat" discovers groups in unlabeled data, when it works beautifully, and when it quietly lies to you.
So far every model in this course has had a teacher. We handed it X and
y, and it learned to map one to the other. But a huge amount of real data
arrives with no labels at all: a table of customers and nothing telling you
which "type" each one is, a pile of sensor readings with no fault tags, a
set of documents nobody has sorted. There is no y. There is only X.
K-means clustering is the first and most famous tool for that
situation. Give it a cloud of points and a number k, and it carves the
cloud into k groups so that points in the same group are close together
and points in different groups are far apart. Nobody told it what the
groups mean — it just finds structure that was already sitting in the
data, waiting to be noticed.
This is your entry point into unsupervised learning, and the ideas here — distance, centers, "similar things belong together" — echo through everything that follows.
Unsupervised: there is no answer key
In supervised learning you can check your work: predict, compare to the true label, measure the error. Clustering has no labels to check against. That freedom is powerful but unnerving — there is no single "right" answer, and judging a clustering takes more care than judging a classifier. We build that judgment here and sharpen it on the Evaluating Clusters page.
The problem it solves
Suppose a streaming service has a million listeners and wants to design, say, five playlists that each appeal to a coherent group. Nobody has labeled the listeners as "jazz fan" or "workout-music person." All you have is behavior: how much each person listens to each genre, time of day, session length. You suspect there are natural types of listener in there — but which ones, and who belongs to which?
This is the core unsupervised question: given unlabeled examples, can we group them so that members of a group resemble each other? That grouping is called a cluster, and finding clusters is cluster analysis.
The applications are everywhere:
- Customer segmentation. Group shoppers by behavior to tailor offers.
- Image compression. Replace millions of pixel colors with
krepresentative colors (each pixel snaps to its cluster's color). - Anomaly hints. Points that sit far from every cluster center are candidates for "weird, look closer."
- Document and gene grouping. Bucket articles by topic or genes by expression pattern without anyone labeling them first.
K-means is the workhorse for all of these because it is fast, simple, and scales to large datasets — qualities we will pay for with some strong assumptions later.
The intuition: centers and the points that belong to them
K-means is built on one idea repeated until it settles. A cluster is
represented by a single point called its centroid (think "the center of
mass of the cluster"). A data point belongs to whichever centroid is
nearest. So the whole model is just k centers, and the rule "you belong
to your closest center."
The clever part is how it finds good centers, because at the start it has no idea where they should go. It uses a loop:
- Pick
kstarting centers. Often justkrandom points to begin. - Assign step. Label every data point with its nearest center.
- Update step. Move each center to the mean of all points currently assigned to it.
- Repeat steps 2 and 3 until the assignments stop changing.
That is the entire algorithm. Each round, points get reassigned to whatever center is now closest, and each center drifts toward the middle of its flock. The two steps chase each other until they reach an agreement nobody wants to leave — a stable configuration where no point would rather switch and no center wants to move.
Why moving to the mean is the right move
The name "k-means" is literal: each center is the mean of its assigned points. Why the mean? Because the mean is exactly the point that minimizes the total squared distance to a set of points. So every update step is guaranteed to reduce (or hold) the total spread of points around their centers. That is why the loop always settles — it can only make things tighter, never looser.
Watching it work on clean data
Let us generate four blobs of points and let k-means rediscover them. The
make_blobs generator gives us labeled blobs, but we will throw the
labels away and pretend we only have the raw points — because that is the
real clustering situation.
fit_predict does two things at once: it runs the loop to find the centers
(fit) and then returns the cluster label for every point (predict).
Those labels are just arbitrary integers 0, 1, 2, 3 — they are group
names, not a ranking. Cluster 0 is not "better" than cluster 3; they
are simply different buckets.
Cluster numbers are arbitrary and unstable
The integer labels k-means assigns carry no meaning and can change between runs or seeds. The cluster that is "0" today might be "2" tomorrow. Never treat the cluster number as ordered or comparable across runs. What matters is which points are grouped together, not the digit on the bucket.
Seeing the clusters
A scatter plot makes it obvious. We color each point by its assigned cluster and mark the centers with big stars.
The colors line up with the blobs your eye already sees, and each red star sits right in the heart of its group. K-means recovered the structure without ever being told it existed. This is the picture-perfect case: round, well-separated, similarly-sized blobs. Keep that picture in mind, because the rest of this page is largely about what happens when reality does not look like this.
Why n_init exists: the starting-point lottery
Step 1 of the loop is "pick k starting centers," and that choice
matters more than you might expect. The loop is guaranteed to settle, but
it can settle into a bad arrangement if it started in an unlucky place —
two centers landing in the same blob while a real blob gets none. The
algorithm finds a local optimum, not necessarily the global best.
The standard defense is to run the whole thing several times from
different random starts and keep the best result. That is exactly what
n_init=10 does: ten independent runs, and k-means returns the one with
the tightest clusters. "Tightest" is measured by inertia — the total
squared distance from every point to its own center. Lower inertia means
points hug their centers more closely.
With n_init=10 the result is rock-steady across seeds; with n_init=1 it
wobbles, and sometimes lands on a noticeably worse (higher-inertia)
solution. This is why the spec — and good practice — always sets
n_init=10 (or more). It is cheap insurance against the starting-point
lottery.
k-means++ is doing quiet work for you
By default scikit-learn does not place the initial centers purely at
random; it uses an initialization called k-means++ that spreads the
starting centers out, making a good outcome far more likely. Combined with
n_init, this is why modern k-means rarely gets stuck. You do not need to
configure it — it is the default — but it is worth knowing your tool is
already helping you here.
Inertia, and the hard question of choosing k
You may have noticed the catch already: k-means cannot tell you how many
clusters there are. You hand it k. If you ask for 4 clusters in data
that really has 3, it will happily split one real group in two and report
back with a straight face. Choosing k is the central practical struggle
of k-means.
A natural instinct is "just minimize inertia." But that instinct fails,
and understanding why is illuminating. Inertia always drops as k grows —
more centers means every point can be closer to one. Push it to the
extreme: if k equals the number of points, every point is its own center
and inertia is exactly zero. That clustering is useless. So you cannot pick
k by minimizing inertia; the minimum is always "give every point its own
cluster."
The elbow method
Instead of chasing the minimum, we look at the shape of the inertia
curve as k increases. Adding clusters helps a lot at first — splitting a
big mixed blob into real groups slashes inertia. But once you have captured
the genuine groups, extra clusters only carve up already-tight groups, and
inertia barely improves. The curve bends sharply at the "right" k and
then flattens. That bend is the elbow.
Read the numbers and the plot together. Inertia plummets from k=1 to
k=4 as real blobs get their own centers, then the drops shrink to almost
nothing. The curve looks like an arm bent at the elbow, and the elbow sits
right at k=4 — the true number of blobs. You did not need labels to find
it; the geometry told you.
The elbow is a judgment call, not a formula
On clean synthetic data the elbow is obvious. On real data it is often a
gentle bend with no crisp corner, and reasonable people will disagree about
where it is. The elbow is a guide, not a decision procedure. Pair it with
the silhouette score (covered in depth on the Evaluating Clusters
page) and, above all, with domain knowledge: if the business needs five
customer segments, k=5 may beat whatever the curve whispers.
The silhouette score is a second, complementary way to choose k that
actually measures how well-separated the clusters are. We deliberately do
not open that box here — it gets its own careful treatment on the
Evaluating Clusters page. For now, know it exists and that the elbow is
your first, quickest sanity check.
You must standardize first
K-means lives and dies by distance, and distance is sensitive to the
scale of your features. Suppose you cluster customers using age (range
roughly 18–80) and annual_income (range roughly 20,000–200,000). Income
spans tens of thousands; age spans tens. When k-means measures distance,
the income differences are thousands of times larger, so they completely
dominate. The age feature might as well not exist. K-means would cluster
purely on income and call it a day.
The fix is to put every feature on a comparable scale before clustering,
almost always with StandardScaler, which rescales each feature to mean 0
and standard deviation 1. (You met scaling on the Data Preprocessing
page; here is why it is non-negotiable for distance-based methods.)
On the raw features the clusters ignore age (large within-cluster age spread) because income drowns it out. After standardizing, age gets a real say and the clusters tighten on both features. For any distance-based method — k-means, k-nearest neighbors, hierarchical clustering — scale your features first unless you have a deliberate reason not to.
Forgetting to scale is the #1 clustering bug
The most common way a real-world clustering goes silently wrong is feeding in unscaled features. The algorithm runs without error, returns clusters, and they are dominated by whichever feature happens to have the biggest numbers — often something arbitrary like a dollar amount or a count. Always standardize before clustering on distances, and be suspicious of any clustering where you did not.
Where k-means breaks: non-spherical shapes
Here is the honest, load-bearing limitation. K-means assigns each point to its nearest center, which means every cluster is effectively a ball around its center — a roughly round, convex region. When the real groups are round-ish blobs, perfect. But when the true shape is a crescent, a ring, or a long thin band, "nearest center" is the wrong rule, and k-means fails in a way that looks confident and is completely wrong.
The classic demonstration is make_moons: two interleaving crescents that
any human eye separates instantly. Watch k-means butcher them.
K-means slices the two moons with a roughly straight boundary, splitting each crescent down the middle and mixing the two real groups together. It cannot do better, because no placement of two centers makes "nearest center" trace a crescent. The algorithm did not malfunction — it did exactly what it always does, and what it does is wrong for this shape.
K-means assumes round, similarly-sized clusters
K-means implicitly assumes clusters are convex, roughly spherical, and similar in size and density. When that holds, it is fast and excellent. When it does not — crescents, rings, elongated streaks, or one giant cluster beside a tiny one — k-means produces confident nonsense. This is not a bug to fix with more iterations; it is built into the "nearest center" rule. For such shapes, reach for density- or connectivity-based methods (DBSCAN, or the Hierarchical Clustering page next).
This is why we make you see the failure. A model that fails loudly is safe; one that fails silently while returning plausible output is dangerous. K-means on the moons fails silently — the output is two clusters, they just happen to be the wrong two. Building the instinct to ask "is this shape even appropriate for k-means?" is worth more than any amount of API fluency.
When to use k-means — and when not to
Reach for k-means when:
- You expect roughly round, comparably-sized groups.
- You have a sensible guess for
k, or you can find one with the elbow and silhouette. - The dataset is large and you need something fast — k-means scales to millions of points where many alternatives choke.
- You want representative "prototype" points (the centroids) for each group, e.g. for compression or summarization.
Avoid k-means when:
- Clusters are non-spherical, nested, or wildly different in size or density (use DBSCAN or hierarchical methods).
- You genuinely have no idea about
kand no way to pick it — k-means forces a choice. - Your features are mostly categorical. Means and Euclidean distance are built for numeric data; one-hot then clustering on distances is shaky. (K-modes and other tools exist for categorical data.)
- A few extreme outliers are present. Because centers are means, a single far-flung point can drag a centroid badly off course.
Real-world reality check
In industry, k-means is often the first thing tried on an unlabeled dataset precisely because it is fast and simple — a quick way to ask "is there any obvious structure here?" The danger is stopping there. A responsible workflow tries k-means, looks at the clusters (plot them, inspect a few members), checks whether the spherical assumption is plausible, and only then trusts the result or switches methods.
Common misconceptions
- "K-means tells you how many clusters there are." It does not. You
supply
k. The elbow and silhouette help you choose it, but the algorithm itself takes the count as input. - "Lower inertia always means a better clustering." Only at a fixed
k. Across differentk, inertia always falls askrises, bottoming out at the useless "every point is its own cluster." Never compare inertia across differentkto pick the winner. - "The cluster numbers mean something." They are arbitrary labels. Cluster 0 is not first or best, and the numbering can flip between runs.
- "K-means found the true clusters." It found a grouping that minimizes within-cluster spread under its spherical assumption. Whether that matches reality depends entirely on whether the assumption fits.
- "Running it once is enough." A single random start can land in a poor
local optimum.
n_init(multiple restarts) is there for a reason — keep it at 10 or more. - "Scaling is optional." For a distance-based method it is essential. Skipping it lets the largest-magnitude feature hijack the result.
Your turn
You are given X: 300 unlabeled points that form 3 blobs
(the true labels are hidden from you, as in real clustering).
- Standardize
XwithStandardScalerintoX_scaled— distance-based methods need features on a comparable scale. - Fit
KMeanswithn_clusters=3,n_init=10,random_state=0onX_scaled. Store the per-point cluster labels inlabels(usefit_predict). - Store the number of distinct clusters found in
n_clusters_found. - Store the size of each cluster (how many points it contains) as a NumPy
array
cluster_sizes, sorted ascending. Hint:np.unique(labels, return_counts=True)then sort the counts.
The hidden tests check that you scaled the data, found exactly 3 clusters, assigned all 300 points, and that the cluster sizes sum to 300.
Check your understanding
In each iteration of k-means, what happens during the update step?
Each point is reassigned to its nearest center
Each center is moved to the mean (average position) of the points currently assigned to it
A new random center is added to split the largest cluster
The number of clusters k is recalculated from the data
Why does KMeans use n_init=10 by recommendation rather than running just once?
It makes the clusters perfectly spherical
It is required to compute inertia
A single random start can settle into a poor local optimum, so k-means runs several times from different starts and keeps the lowest-inertia result
It automatically determines the correct number of clusters
You compute inertia for k = 1, 2, ..., 10 and notice it keeps decreasing as k grows. Can you choose the best k by simply picking the one with the lowest inertia?
Yes — lowest inertia is always the best clustering
No — inertia always decreases as k increases (reaching zero when every point is its own cluster), so you look for the "elbow" where the decrease sharply slows instead
No — inertia is meaningless and should never be used
Yes — but only if you also standardize the features
You run k-means on make_moons (two interleaving crescents) with k=2 and the result splits each crescent in half rather than separating the two moons. What is the best explanation?
The data needs more samples
You forgot to set random_state
K-means assigns points to the nearest center, so its clusters are effectively round regions — it cannot capture crescent shapes no matter how it is tuned
The inertia was too low
You cluster customers using age (roughly 18–80) and annual_income (roughly 20,000–200,000) without scaling. What is the likely consequence?
Age will dominate because it has fewer digits
The clustering will be unaffected by feature scale
Income will dominate the distance calculations because its values are far larger, so the clusters form almost entirely along income and age is effectively ignored
K-means will raise an error about unscaled features
Feature Engineering
A model can only learn from patterns that are visible in the features you give it. Reshaping raw columns into the right representation often beats any fancier algorithm.
Hierarchical Clustering
Building a whole family tree of clusters from the bottom up — how agglomerative merging works, how to read a dendrogram, and why "you do not have to commit to k upfront" is both its superpower and its cost.