Evaluating Clusters
Clustering has no answer key, so how do you know if the groups are any good? The silhouette score and its limits — plus why cluster evaluation is part math, part judgment.
Supervised models have it easy: there is a true label, so you can measure how often you match it. Clustering has no such luxury. It is unsupervised — you handed the algorithm data with no labels and asked it to invent groups. So when it hands back clusters, the unsettling question is: good compared to what? This chapter is about the honest answers, the most useful metric (the silhouette score), and the discipline of not letting a single number decide for you.
Why this is genuinely hard
With no ground truth, "correct" clustering is not even well-defined. The same points can be grouped several reasonable ways depending on what you care about. Customers could cluster by spending, by age, by region — none is objectively the answer. So cluster evaluation splits into two situations:
- No labels at all (the usual case). You judge clusters by their geometry: are points in the same cluster close together, and far from other clusters? These are called internal metrics. The silhouette score is the workhorse.
- You secretly have labels (rare, e.g. a benchmark). You can compare the clusters to the known groups with external metrics like the adjusted Rand index. Useful for research, seldom available in real problems.
Internal vs external, in a sentence
Internal metrics ask "are these clusters geometrically tight and well-separated?" using only the data. External metrics ask "do these clusters match a known ground truth?" and need labels you usually do not have. The silhouette score is internal.
The silhouette score
The silhouette score captures a simple, intuitive idea: a point is
well-clustered if it is close to its own cluster and far from the
nearest other cluster. For each point i:
Formally, with a(i) the average distance from point i to the other
points in its own cluster, and b(i) the average distance to the points of
the nearest other cluster:
- What it measures. Per point, how much better it fits its own cluster than the nearest alternative, scaled to lie between −1 and +1.
- s(i) near +1 — the point is much closer to its own cluster than to any other: a confident, clean assignment.
- s(i) near 0 — the point sits on the boundary between two clusters; it could plausibly belong to either.
- s(i) negative — the point is, on average, closer to a different cluster than its own: it was probably assigned to the wrong group.
The overall silhouette score is just the mean of s(i) across all
points — one number summarizing how cleanly separated the whole clustering
is.
For clean, well-separated blobs the score is high — close to the structure's
natural quality. Note the metric used only X and the predicted labels;
it never needed a true answer.
Using the silhouette score to choose k
The most practical use of the silhouette score is picking the number of
clusters. Try several values of k, score each, and let the data vote.
The silhouette score peaks at the number of clusters the data actually has. Unlike the elbow method (which eyeballs a bend in the inertia curve), the silhouette gives a concrete number to compare, so the choice is less subjective.
The per-point silhouette plot
Averaging hides detail. The classic silhouette plot shows every point's
s(i), grouped by cluster, so you can spot a cluster full of borderline or
negative points even when the average looks fine.
Wide, tall "knives" that mostly clear the red average line are healthy clusters. A cluster whose bars are short, or that dips below zero, is a warning that those points do not really belong together — something the single average score would have quietly absorbed.
The trap: a better score is not a "more correct" clustering
Here is the most important caution in this chapter. The silhouette score rewards compact, round, well-separated clusters. When the true structure is not shaped like that, the metric can confidently prefer the wrong answer.
The round clustering that is wrong for this data scores higher than the crescents that are right. If you trusted the silhouette score blindly, it would lead you straight to the worse clustering. The metric is measuring geometry, not truth.
A metric measures geometry, not meaning
A higher silhouette score means "rounder, more separated clusters," which is only the same as "better" when the real structure happens to be round and separated. For elongated, nested, or irregular structure, the silhouette can reward the wrong answer. Never let it overrule what you know about the problem.
What the silhouette score does not tell you
- Whether the clusters are meaningful. It is pure geometry. Clusters can be tight and well-separated yet correspond to nothing you care about.
- The right algorithm or shape. It implicitly favors convex, round clusters (like K-Means produces), so comparing a density-based clustering to K-Means by silhouette alone is unfair to the former.
- Anything about a single point's importance. A high average can hide a cluster of borderline points — always glance at the per-point plot.
- Causation or business value. That a clean cluster exists says nothing about whether acting on it is profitable or wise.
When external metrics apply
If you happen to have true labels (a benchmark, a labeled sample), you can score how well clusters recover them with the adjusted Rand index, which is 1.0 for a perfect match and about 0.0 for random labeling — and, crucially, does not care what you name the clusters, only how points are grouped together.
A score near 1.0 means the discovered clusters line up almost perfectly with
the true groups, regardless of how the cluster numbers were assigned. But
remember: in real unsupervised work you usually do not have y_true — if
you did, you might not be clustering at all.
Common misconceptions
- "Higher silhouette always means a better clustering." Only when the true structure is round and separated. It can prefer wrong groupings on curved or nested data.
- "The silhouette score tells me the right number of clusters." It
suggests one, and often a good one — but on ambiguous data several
kvalues can score similarly, and the metric cannot know your purpose. - "A good score means the clusters are meaningful." Geometry is not meaning. Validate clusters against domain knowledge and downstream use.
- "Cluster numbers are meaningful labels." They are arbitrary; cluster "0" in one run can be cluster "2" in another. Compare groupings, not IDs.
Real-world applications
Marketing teams cluster customers and then check whether the segments are actionable, not just tight — a beautiful silhouette is useless if the segments cannot be targeted differently. Biologists cluster gene-expression profiles and validate against known pathways. In every case the metric is a flashlight, not a judge: it helps you find structure and compare options, but a human decides whether the structure means anything.
Your turn
A dataset X is provided (it was generated with several blobs).
- For each
kinrange(2, 8), fitKMeans(n_clusters=k, n_init=10, random_state=0)and compute thesilhouette_scoreof its labels. - Collect the scores, in order, into a list called
sil_scores. - Set
best_kto thekwith the highest silhouette score.
The tests check that sil_scores has 6 entries, that every score is in the
valid range from -1 to 1, and that best_k is the k with the highest
silhouette score. (Worth noticing: the winning k here need not equal the 5
blobs the data was generated with — a live reminder that the silhouette
optimizes geometry, not the "true" count.)
Check your understanding
Why is evaluating a clustering fundamentally harder than evaluating a classifier?
Clustering algorithms are slower
Clustering is unsupervised — there is no ground-truth label to compare against, so "correct" is not even well-defined
Classifiers never make mistakes
Clustering cannot be measured at all
What does a point's silhouette value close to +1 indicate?
The point is on the boundary between two clusters
The point is much closer to its own cluster than to the nearest other cluster — a clean, confident assignment
The point was assigned to the wrong cluster
The clustering used too many clusters
A negative silhouette value for a point most likely means:
The point is a perfect cluster center
The point is, on average, closer to a different cluster than to its own — it was probably assigned to the wrong group
The silhouette score was computed incorrectly
The point has no neighbors
On two interleaving crescent-shaped clusters, K-Means's round split scored a higher silhouette than the true crescents. What is the lesson?
The silhouette score was buggy
The crescents are not really clusters
The silhouette score rewards round, compact, separated clusters, so it can prefer a geometrically wrong grouping when the true structure is not round
K-Means is always the best clustering method
You want to choose the number of clusters k. How is the silhouette score typically used for this?
Fit one k and trust it
Try several values of k, compute the silhouette score for each, and favor the k with the highest score (sanity-checked against domain sense)
Pick the k with the lowest score
The silhouette score cannot inform the choice of k
Your clustering has a high average silhouette score. What should you still check before trusting it?
Nothing — a high average is conclusive
The per-point silhouette plot and whether the clusters are meaningful for your problem, since a high average can hide borderline clusters and says nothing about real-world value
That the cluster ID numbers are sorted
That every cluster has exactly the same size
Hierarchical Clustering
Building a whole family tree of clusters from the bottom up — how agglomerative merging works, how to read a dendrogram, and why "you do not have to commit to k upfront" is both its superpower and its cost.
Hyperparameter Tuning
The difference between what a model learns and what you choose for it — and how to choose well without quietly cheating on the test set.