Dataslope logoDataslope

Evaluating Clusters

Clustering has no answer key, so how do you know if the groups are any good? The silhouette score and its limits — plus why cluster evaluation is part math, part judgment.

Supervised models have it easy: there is a true label, so you can measure how often you match it. Clustering has no such luxury. It is unsupervised — you handed the algorithm data with no labels and asked it to invent groups. So when it hands back clusters, the unsettling question is: good compared to what? This chapter is about the honest answers, the most useful metric (the silhouette score), and the discipline of not letting a single number decide for you.

Why this is genuinely hard

With no ground truth, "correct" clustering is not even well-defined. The same points can be grouped several reasonable ways depending on what you care about. Customers could cluster by spending, by age, by region — none is objectively the answer. So cluster evaluation splits into two situations:

  • No labels at all (the usual case). You judge clusters by their geometry: are points in the same cluster close together, and far from other clusters? These are called internal metrics. The silhouette score is the workhorse.
  • You secretly have labels (rare, e.g. a benchmark). You can compare the clusters to the known groups with external metrics like the adjusted Rand index. Useful for research, seldom available in real problems.

Internal vs external, in a sentence

Internal metrics ask "are these clusters geometrically tight and well-separated?" using only the data. External metrics ask "do these clusters match a known ground truth?" and need labels you usually do not have. The silhouette score is internal.

The silhouette score

The silhouette score captures a simple, intuitive idea: a point is well-clustered if it is close to its own cluster and far from the nearest other cluster. For each point i:

Formally, with a(i) the average distance from point i to the other points in its own cluster, and b(i) the average distance to the points of the nearest other cluster:

s(i)=b(i)a(i)max{a(i), b(i)}s(i) = \frac{b(i) - a(i)}{\max\lbrace\,a(i),\ b(i)\,\rbrace}
  • What it measures. Per point, how much better it fits its own cluster than the nearest alternative, scaled to lie between −1 and +1.
  • s(i) near +1 — the point is much closer to its own cluster than to any other: a confident, clean assignment.
  • s(i) near 0 — the point sits on the boundary between two clusters; it could plausibly belong to either.
  • s(i) negative — the point is, on average, closer to a different cluster than its own: it was probably assigned to the wrong group.

The overall silhouette score is just the mean of s(i) across all points — one number summarizing how cleanly separated the whole clustering is.

Code Block
Python 3.13.2

For clean, well-separated blobs the score is high — close to the structure's natural quality. Note the metric used only X and the predicted labels; it never needed a true answer.

Using the silhouette score to choose k

The most practical use of the silhouette score is picking the number of clusters. Try several values of k, score each, and let the data vote.

Code Block
Python 3.13.2

The silhouette score peaks at the number of clusters the data actually has. Unlike the elbow method (which eyeballs a bend in the inertia curve), the silhouette gives a concrete number to compare, so the choice is less subjective.

The per-point silhouette plot

Averaging hides detail. The classic silhouette plot shows every point's s(i), grouped by cluster, so you can spot a cluster full of borderline or negative points even when the average looks fine.

Code Block
Python 3.13.2

Wide, tall "knives" that mostly clear the red average line are healthy clusters. A cluster whose bars are short, or that dips below zero, is a warning that those points do not really belong together — something the single average score would have quietly absorbed.

The trap: a better score is not a "more correct" clustering

Here is the most important caution in this chapter. The silhouette score rewards compact, round, well-separated clusters. When the true structure is not shaped like that, the metric can confidently prefer the wrong answer.

Code Block
Python 3.13.2

The round clustering that is wrong for this data scores higher than the crescents that are right. If you trusted the silhouette score blindly, it would lead you straight to the worse clustering. The metric is measuring geometry, not truth.

A metric measures geometry, not meaning

A higher silhouette score means "rounder, more separated clusters," which is only the same as "better" when the real structure happens to be round and separated. For elongated, nested, or irregular structure, the silhouette can reward the wrong answer. Never let it overrule what you know about the problem.

What the silhouette score does not tell you

  • Whether the clusters are meaningful. It is pure geometry. Clusters can be tight and well-separated yet correspond to nothing you care about.
  • The right algorithm or shape. It implicitly favors convex, round clusters (like K-Means produces), so comparing a density-based clustering to K-Means by silhouette alone is unfair to the former.
  • Anything about a single point's importance. A high average can hide a cluster of borderline points — always glance at the per-point plot.
  • Causation or business value. That a clean cluster exists says nothing about whether acting on it is profitable or wise.

When external metrics apply

If you happen to have true labels (a benchmark, a labeled sample), you can score how well clusters recover them with the adjusted Rand index, which is 1.0 for a perfect match and about 0.0 for random labeling — and, crucially, does not care what you name the clusters, only how points are grouped together.

Code Block
Python 3.13.2

A score near 1.0 means the discovered clusters line up almost perfectly with the true groups, regardless of how the cluster numbers were assigned. But remember: in real unsupervised work you usually do not have y_true — if you did, you might not be clustering at all.

Common misconceptions

  • "Higher silhouette always means a better clustering." Only when the true structure is round and separated. It can prefer wrong groupings on curved or nested data.
  • "The silhouette score tells me the right number of clusters." It suggests one, and often a good one — but on ambiguous data several k values can score similarly, and the metric cannot know your purpose.
  • "A good score means the clusters are meaningful." Geometry is not meaning. Validate clusters against domain knowledge and downstream use.
  • "Cluster numbers are meaningful labels." They are arbitrary; cluster "0" in one run can be cluster "2" in another. Compare groupings, not IDs.

Real-world applications

Marketing teams cluster customers and then check whether the segments are actionable, not just tight — a beautiful silhouette is useless if the segments cannot be targeted differently. Biologists cluster gene-expression profiles and validate against known pathways. In every case the metric is a flashlight, not a judge: it helps you find structure and compare options, but a human decides whether the structure means anything.

Your turn

Challenge
Python 3.13.2
Pick k with the silhouette score

A dataset X is provided (it was generated with several blobs).

  1. For each k in range(2, 8), fit KMeans(n_clusters=k, n_init=10, random_state=0) and compute the silhouette_score of its labels.
  2. Collect the scores, in order, into a list called sil_scores.
  3. Set best_k to the k with the highest silhouette score.

The tests check that sil_scores has 6 entries, that every score is in the valid range from -1 to 1, and that best_k is the k with the highest silhouette score. (Worth noticing: the winning k here need not equal the 5 blobs the data was generated with — a live reminder that the silhouette optimizes geometry, not the "true" count.)

Check your understanding

QuestionSelect one

Why is evaluating a clustering fundamentally harder than evaluating a classifier?

Clustering algorithms are slower

Clustering is unsupervised — there is no ground-truth label to compare against, so "correct" is not even well-defined

Classifiers never make mistakes

Clustering cannot be measured at all

QuestionSelect one

What does a point's silhouette value close to +1 indicate?

The point is on the boundary between two clusters

The point is much closer to its own cluster than to the nearest other cluster — a clean, confident assignment

The point was assigned to the wrong cluster

The clustering used too many clusters

QuestionSelect one

A negative silhouette value for a point most likely means:

The point is a perfect cluster center

The point is, on average, closer to a different cluster than to its own — it was probably assigned to the wrong group

The silhouette score was computed incorrectly

The point has no neighbors

QuestionSelect one

On two interleaving crescent-shaped clusters, K-Means's round split scored a higher silhouette than the true crescents. What is the lesson?

The silhouette score was buggy

The crescents are not really clusters

The silhouette score rewards round, compact, separated clusters, so it can prefer a geometrically wrong grouping when the true structure is not round

K-Means is always the best clustering method

QuestionSelect one

You want to choose the number of clusters k. How is the silhouette score typically used for this?

Fit one k and trust it

Try several values of k, compute the silhouette score for each, and favor the k with the highest score (sanity-checked against domain sense)

Pick the k with the lowest score

The silhouette score cannot inform the choice of k

QuestionSelect one

Your clustering has a high average silhouette score. What should you still check before trusting it?

Nothing — a high average is conclusive

The per-point silhouette plot and whether the clusters are meaningful for your problem, since a high average can hide borderline clusters and says nothing about real-world value

That the cluster ID numbers are sorted

That every cluster has exactly the same size

On this page