K-Means Clustering

The flagship unsupervised algorithm — how a simple loop of "assign points, move centers, repeat" discovers groups in unlabeled data, when it works beautifully, and when it quietly lies to you.

So far every model in this course has had a teacher. We handed it X and y, and it learned to map one to the other. But a huge amount of real data arrives with no labels at all: a table of customers and nothing telling you which "type" each one is, a pile of sensor readings with no fault tags, a set of documents nobody has sorted. There is no y. There is only X.

K-means clustering is the first and most famous tool for that situation. Give it a cloud of points and a number k, and it carves the cloud into k groups so that points in the same group are close together and points in different groups are far apart. Nobody told it what the groups mean — it just finds structure that was already sitting in the data, waiting to be noticed.

This is your entry point into unsupervised learning, and the ideas here — distance, centers, "similar things belong together" — echo through everything that follows.

Unsupervised: there is no answer key

In supervised learning you can check your work: predict, compare to the true label, measure the error. Clustering has no labels to check against. That freedom is powerful but unnerving — there is no single "right" answer, and judging a clustering takes more care than judging a classifier. We build that judgment here and sharpen it on the Evaluating Clusters page.

The problem it solves

Suppose a streaming service has a million listeners and wants to design, say, five playlists that each appeal to a coherent group. Nobody has labeled the listeners as "jazz fan" or "workout-music person." All you have is behavior: how much each person listens to each genre, time of day, session length. You suspect there are natural types of listener in there — but which ones, and who belongs to which?

This is the core unsupervised question: given unlabeled examples, can we group them so that members of a group resemble each other? That grouping is called a cluster, and finding clusters is cluster analysis.

The applications are everywhere:

Customer segmentation. Group shoppers by behavior to tailor offers.
Image compression. Replace millions of pixel colors with k representative colors (each pixel snaps to its cluster's color).
Anomaly hints. Points that sit far from every cluster center are candidates for "weird, look closer."
Document and gene grouping. Bucket articles by topic or genes by expression pattern without anyone labeling them first.

K-means is the workhorse for all of these because it is fast, simple, and scales to large datasets — qualities we will pay for with some strong assumptions later.

The intuition: centers and the points that belong to them

K-means is built on one idea repeated until it settles. A cluster is represented by a single point called its centroid (think "the center of mass of the cluster"). A data point belongs to whichever centroid is nearest. So the whole model is just k centers, and the rule "you belong to your closest center."

The clever part is how it finds good centers, because at the start it has no idea where they should go. It uses a loop:

Pick k starting centers. Often just k random points to begin.
Assign step. Label every data point with its nearest center.
Update step. Move each center to the mean of all points currently assigned to it.
Repeat steps 2 and 3 until the assignments stop changing.

That is the entire algorithm. Each round, points get reassigned to whatever center is now closest, and each center drifts toward the middle of its flock. The two steps chase each other until they reach an agreement nobody wants to leave — a stable configuration where no point would rather switch and no center wants to move.

Why moving to the mean is the right move

The name "k-means" is literal: each center is the mean of its assigned points. Why the mean? Because the mean is exactly the point that minimizes the total squared distance to a set of points. So every update step is guaranteed to reduce (or hold) the total spread of points around their centers. That is why the loop always settles — it can only make things tighter, never looser.

Watching it work on clean data

Let us generate four blobs of points and let k-means rediscover them. The make_blobs generator gives us labeled blobs, but we will throw the labels away and pretend we only have the raw points — because that is the real clustering situation.

fit_predict does two things at once: it runs the loop to find the centers (fit) and then returns the cluster label for every point (predict). Those labels are just arbitrary integers 0, 1, 2, 3 — they are group names, not a ranking. Cluster 0 is not "better" than cluster 3; they are simply different buckets.

Cluster numbers are arbitrary and unstable

The integer labels k-means assigns carry no meaning and can change between runs or seeds. The cluster that is "0" today might be "2" tomorrow. Never treat the cluster number as ordered or comparable across runs. What matters is which points are grouped together, not the digit on the bucket.

Seeing the clusters

A scatter plot makes it obvious. We color each point by its assigned cluster and mark the centers with big stars.

The colors line up with the blobs your eye already sees, and each red star sits right in the heart of its group. K-means recovered the structure without ever being told it existed. This is the picture-perfect case: round, well-separated, similarly-sized blobs. Keep that picture in mind, because the rest of this page is largely about what happens when reality does not look like this.

Why `n_init` exists: the starting-point lottery

Step 1 of the loop is "pick k starting centers," and that choice matters more than you might expect. The loop is guaranteed to settle, but it can settle into a bad arrangement if it started in an unlucky place — two centers landing in the same blob while a real blob gets none. The algorithm finds a local optimum, not necessarily the global best.

The standard defense is to run the whole thing several times from different random starts and keep the best result. That is exactly what n_init=10 does: ten independent runs, and k-means returns the one with the tightest clusters. "Tightest" is measured by inertia — the total squared distance from every point to its own center. Lower inertia means points hug their centers more closely.

With n_init=10 the result is rock-steady across seeds; with n_init=1 it wobbles, and sometimes lands on a noticeably worse (higher-inertia) solution. This is why the spec — and good practice — always sets n_init=10 (or more). It is cheap insurance against the starting-point lottery.

k-means++ is doing quiet work for you

By default scikit-learn does not place the initial centers purely at random; it uses an initialization called k-means++ that spreads the starting centers out, making a good outcome far more likely. Combined with n_init, this is why modern k-means rarely gets stuck. You do not need to configure it — it is the default — but it is worth knowing your tool is already helping you here.

Inertia, and the hard question of choosing `k`

You may have noticed the catch already: k-means cannot tell you how many clusters there are. You hand it k. If you ask for 4 clusters in data that really has 3, it will happily split one real group in two and report back with a straight face. Choosing k is the central practical struggle of k-means.

A natural instinct is "just minimize inertia." But that instinct fails, and understanding why is illuminating. Inertia always drops as k grows — more centers means every point can be closer to one. Push it to the extreme: if k equals the number of points, every point is its own center and inertia is exactly zero. That clustering is useless. So you cannot pick k by minimizing inertia; the minimum is always "give every point its own cluster."

The elbow method

Instead of chasing the minimum, we look at the shape of the inertia curve as k increases. Adding clusters helps a lot at first — splitting a big mixed blob into real groups slashes inertia. But once you have captured the genuine groups, extra clusters only carve up already-tight groups, and inertia barely improves. The curve bends sharply at the "right" k and then flattens. That bend is the elbow.

Read the numbers and the plot together. Inertia plummets from k=1 to k=4 as real blobs get their own centers, then the drops shrink to almost nothing. The curve looks like an arm bent at the elbow, and the elbow sits right at k=4 — the true number of blobs. You did not need labels to find it; the geometry told you.

The elbow is a judgment call, not a formula

On clean synthetic data the elbow is obvious. On real data it is often a gentle bend with no crisp corner, and reasonable people will disagree about where it is. The elbow is a guide, not a decision procedure. Pair it with the silhouette score (covered in depth on the Evaluating Clusters page) and, above all, with domain knowledge: if the business needs five customer segments, k=5 may beat whatever the curve whispers.

The silhouette score is a second, complementary way to choose k that actually measures how well-separated the clusters are. We deliberately do not open that box here — it gets its own careful treatment on the Evaluating Clusters page. For now, know it exists and that the elbow is your first, quickest sanity check.

You must standardize first

K-means lives and dies by distance, and distance is sensitive to the scale of your features. Suppose you cluster customers using age (range roughly 18–80) and annual_income (range roughly 20,000–200,000). Income spans tens of thousands; age spans tens. When k-means measures distance, the income differences are thousands of times larger, so they completely dominate. The age feature might as well not exist. K-means would cluster purely on income and call it a day.

The fix is to put every feature on a comparable scale before clustering, almost always with StandardScaler, which rescales each feature to mean 0 and standard deviation 1. (You met scaling on the Data Preprocessing page; here is why it is non-negotiable for distance-based methods.)

On the raw features the clusters ignore age (large within-cluster age spread) because income drowns it out. After standardizing, age gets a real say and the clusters tighten on both features. For any distance-based method — k-means, k-nearest neighbors, hierarchical clustering — scale your features first unless you have a deliberate reason not to.

Forgetting to scale is the #1 clustering bug

The most common way a real-world clustering goes silently wrong is feeding in unscaled features. The algorithm runs without error, returns clusters, and they are dominated by whichever feature happens to have the biggest numbers — often something arbitrary like a dollar amount or a count. Always standardize before clustering on distances, and be suspicious of any clustering where you did not.

Where k-means breaks: non-spherical shapes

Here is the honest, load-bearing limitation. K-means assigns each point to its nearest center, which means every cluster is effectively a ball around its center — a roughly round, convex region. When the real groups are round-ish blobs, perfect. But when the true shape is a crescent, a ring, or a long thin band, "nearest center" is the wrong rule, and k-means fails in a way that looks confident and is completely wrong.

The classic demonstration is make_moons: two interleaving crescents that any human eye separates instantly. Watch k-means butcher them.

K-means slices the two moons with a roughly straight boundary, splitting each crescent down the middle and mixing the two real groups together. It cannot do better, because no placement of two centers makes "nearest center" trace a crescent. The algorithm did not malfunction — it did exactly what it always does, and what it does is wrong for this shape.

K-means assumes round, similarly-sized clusters

K-means implicitly assumes clusters are convex, roughly spherical, and similar in size and density. When that holds, it is fast and excellent. When it does not — crescents, rings, elongated streaks, or one giant cluster beside a tiny one — k-means produces confident nonsense. This is not a bug to fix with more iterations; it is built into the "nearest center" rule. For such shapes, reach for density- or connectivity-based methods (DBSCAN, or the Hierarchical Clustering page next).

This is why we make you see the failure. A model that fails loudly is safe; one that fails silently while returning plausible output is dangerous. K-means on the moons fails silently — the output is two clusters, they just happen to be the wrong two. Building the instinct to ask "is this shape even appropriate for k-means?" is worth more than any amount of API fluency.

When to use k-means — and when not to

Reach for k-means when:

You expect roughly round, comparably-sized groups.
You have a sensible guess for k, or you can find one with the elbow and silhouette.
The dataset is large and you need something fast — k-means scales to millions of points where many alternatives choke.
You want representative "prototype" points (the centroids) for each group, e.g. for compression or summarization.

Avoid k-means when:

Clusters are non-spherical, nested, or wildly different in size or density (use DBSCAN or hierarchical methods).
You genuinely have no idea about k and no way to pick it — k-means forces a choice.
Your features are mostly categorical. Means and Euclidean distance are built for numeric data; one-hot then clustering on distances is shaky. (K-modes and other tools exist for categorical data.)
A few extreme outliers are present. Because centers are means, a single far-flung point can drag a centroid badly off course.

Real-world reality check

In industry, k-means is often the first thing tried on an unlabeled dataset precisely because it is fast and simple — a quick way to ask "is there any obvious structure here?" The danger is stopping there. A responsible workflow tries k-means, looks at the clusters (plot them, inspect a few members), checks whether the spherical assumption is plausible, and only then trusts the result or switches methods.

Common misconceptions

"K-means tells you how many clusters there are." It does not. You supply k. The elbow and silhouette help you choose it, but the algorithm itself takes the count as input.
"Lower inertia always means a better clustering." Only at a fixed k. Across different k, inertia always falls as k rises, bottoming out at the useless "every point is its own cluster." Never compare inertia across different k to pick the winner.
"The cluster numbers mean something." They are arbitrary labels. Cluster 0 is not first or best, and the numbering can flip between runs.
"K-means found the true clusters." It found a grouping that minimizes within-cluster spread under its spherical assumption. Whether that matches reality depends entirely on whether the assumption fits.
"Running it once is enough." A single random start can land in a poor local optimum. n_init (multiple restarts) is there for a reason — keep it at 10 or more.
"Scaling is optional." For a distance-based method it is essential. Skipping it lets the largest-magnitude feature hijack the result.

Your turn

You are given X: 300 unlabeled points that form 3 blobs (the true labels are hidden from you, as in real clustering).

Standardize X with StandardScaler into X_scaled — distance-based methods need features on a comparable scale.
Fit KMeans with n_clusters=3, n_init=10, random_state=0 on X_scaled. Store the per-point cluster labels in labels (use fit_predict).
Store the number of distinct clusters found in n_clusters_found.
Store the size of each cluster (how many points it contains) as a NumPy array cluster_sizes, sorted ascending. Hint: np.unique(labels, return_counts=True) then sort the counts.

The hidden tests check that you scaled the data, found exactly 3 clusters, assigned all 300 points, and that the cluster sizes sum to 300.

Check your understanding

QuestionSelect one

In each iteration of k-means, what happens during the update step?

Each point is reassigned to its nearest center

Each center is moved to the mean (average position) of the points currently assigned to it

A new random center is added to split the largest cluster

The number of clusters k is recalculated from the data

QuestionSelect one

Why does KMeans use n_init=10 by recommendation rather than running just once?

It makes the clusters perfectly spherical

It is required to compute inertia

A single random start can settle into a poor local optimum, so k-means runs several times from different starts and keeps the lowest-inertia result

It automatically determines the correct number of clusters

QuestionSelect one

You compute inertia for k = 1, 2, ..., 10 and notice it keeps decreasing as k grows. Can you choose the best k by simply picking the one with the lowest inertia?

Yes — lowest inertia is always the best clustering

No — inertia always decreases as k increases (reaching zero when every point is its own cluster), so you look for the "elbow" where the decrease sharply slows instead

No — inertia is meaningless and should never be used

Yes — but only if you also standardize the features

QuestionSelect one

You run k-means on make_moons (two interleaving crescents) with k=2 and the result splits each crescent in half rather than separating the two moons. What is the best explanation?

The data needs more samples

You forgot to set random_state

K-means assigns points to the nearest center, so its clusters are effectively round regions — it cannot capture crescent shapes no matter how it is tuned

The inertia was too low

QuestionSelect one

You cluster customers using age (roughly 18–80) and annual_income (roughly 20,000–200,000) without scaling. What is the likely consequence?

Age will dominate because it has fewer digits

The clustering will be unaffected by feature scale

Income will dominate the distance calculations because its values are far larger, so the clusters form almost entirely along income and age is effectively ignored

K-means will raise an error about unscaled features

K-Means Clustering

On this page