Hierarchical Clustering

Building a whole family tree of clusters from the bottom up — how agglomerative merging works, how to read a dendrogram, and why "you do not have to commit to k upfront" is both its superpower and its cost.

K-means asked you to make an awkward promise before you had even seen the results: how many clusters do you want? You had to commit to k, then hope the elbow agreed. Hierarchical clustering refuses to make you choose so early. Instead of producing one flat partition, it builds an entire tree of clusters — from "every point is its own cluster" all the way up to "everything is one big cluster" — and lets you decide afterward where to slice it.

That tree, drawn as a diagram called a dendrogram, is the real prize. It does not just tell you the groups; it shows you which groups are tight, which are loose, which two clusters are most similar, and how the whole dataset nests together at every level of granularity. For many problems that picture is worth more than any single set of labels.

This page assumes you have met clustering on the K-Means Clustering page. We will lean on the same instincts — distance, "similar things belong together," standardize before you measure distance — and add a new one: structure.

Same family, different philosophy

K-means and hierarchical clustering are both unsupervised and both distance-based, but they think differently. K-means starts with k centers and shuffles points around them. Hierarchical clustering never places centers at all — it grows a tree by repeatedly fusing the two closest clusters. Neither is "better"; they answer slightly different questions and fail in different ways.

The problem it solves

Real groupings are often nested, not flat. Think about how you would organize animals: at a coarse level, mammals vs birds vs fish. Zoom in and mammals split into primates, carnivores, rodents. Zoom further and carnivores split into cats and dogs. There is no single "correct" number of groups — it depends on how closely you want to look. A flat method like k-means forces you to pick one zoom level. Hierarchical clustering captures all of them at once.

So the question hierarchical clustering answers is broader than k-means's: not just "what are the k groups?" but "how do these points relate to one another at every level of similarity, from individuals up to the whole dataset?"

Where you see this in the wild:

Taxonomy and phylogenetics. Building evolutionary trees from genetic distances is hierarchical clustering by another name.
Customer or product hierarchies. Coarse segments that split into finer sub-segments, all from one tree.
Document organization. Topics that nest into subtopics.
Exploratory analysis on small data. When you have a few dozen items and want to see their relationships before committing to anything, a dendrogram is often the single most informative plot you can make.

The intuition: start apart, merge the closest, repeat

The most common form is agglomerative — a fancy word for "bottom-up, by merging." (There is also a top-down "divisive" variant, but agglomerative is what you will use and what scikit-learn provides.) The recipe is almost embarrassingly simple:

Start with every point as its own cluster. With 20 points, you begin with 20 clusters of size one.
Find the two closest clusters and merge them into a single cluster.
Repeat: again find the two closest clusters (now including the merged one) and fuse them.
Stop when everything has been merged into one cluster at the top.

Each merge reduces the cluster count by exactly one, so after n - 1 merges you have gone from n singletons to one giant cluster. Along the way you recorded every intermediate grouping. That recorded sequence of merges is the tree.

Read that diagram bottom-up, the way the algorithm builds it. First a and b are closest, so they merge. Separately c and d merge. Then e joins the {c, d} cluster because it is closer to them than to {a, b}. Finally the two remaining clusters fuse at the top. Every fork is a merge, and the height at which two branches join tells you how far apart those clusters were when they merged — the key to reading the real thing.

The big idea: height encodes distance

In a dendrogram, the vertical position of each merge is the distance between the two clusters that fused there. Low merges joined very similar things; high merges joined things that were quite different. So the tree is not just a grouping — it is a map of similarity, with the y-axis as the ruler.

Clustering blobs with `AgglomerativeClustering`

If you just want flat labels — "put these points into k groups" — scikit-learn's AgglomerativeClustering builds the tree internally and cuts it for you at the k you ask for. The interface mirrors k-means, but notice there is no n_init: the algorithm is deterministic, with no random starts to retry.

On clean blobs this gives essentially the same answer k-means would — four sensible groups. The difference is what is happening underneath: there are no centroids and no iteration toward them. The model fused points and sub-clusters until only four remained, then handed you the result of cutting the tree at that level.

No random_state, no n_init — and why

Agglomerative clustering is deterministic: given the same data and the same linkage rule, it always produces the same tree. There is no random initialization to get unlucky with, so there is nothing to restart. That is one genuine advantage over k-means — you never wonder whether a different seed would have changed your clusters.

Seeing the tree: a dendrogram on a small sample

The flat labels are useful, but they hide the structure that makes hierarchical clustering special. To see the tree we draw a dendrogram, and for that we use SciPy, whose linkage and dendrogram functions are the standard tools.

A dendrogram of 300 points is an unreadable smear, so we plot a small sample — about 18 points — where every leaf and merge is legible. This is not a limitation of the method; it is just good visualization. You read the shape on a small or summarized sample, then apply the method at full scale for the labels.

Spend a moment reading it, because this picture is the whole point:

The leaves at the bottom are the 18 individual points, each starting as its own cluster.
Each upside-down U is a merge. The two branches under it are the clusters that fused; the height of the U is how far apart they were.
Short merges near the bottom join very similar points — these are your tight, confident sub-clusters.
The tall merges near the top join groups that are quite different. See those one or two big vertical jumps? They are where the three real blobs finally get stitched together, and the large gap before them is the visual signature of "there are about three real clusters here."

Choose k by where the tree has the biggest gap

To pick the number of clusters from a dendrogram, imagine a horizontal line sliding down from the top. Cut where you cross the tallest vertical stretch with no merges — the biggest jump in merge distance. Cutting there separates groups that were very far apart while keeping together groups that were close. On the plot above, that long gap before the final merges points squarely at three clusters.

Cutting the tree at any level you like

The dendrogram makes hierarchical clustering's superpower concrete: one tree gives you every k at once. You do not refit anything to change the number of clusters — you just slide the cut line. SciPy's fcluster does exactly this.

The same tree yields 2, 3, 4, or 5 clusters depending only on where you slice. With k-means you would have had to rerun the algorithm from scratch for each k. Here the expensive part — building the tree — happens once, and exploring different granularities is free.

Linkage: what "distance between two clusters" even means

There is a subtlety we glossed over. "Merge the two closest clusters" is clear when clusters are single points. But once a cluster contains many points, what is the distance between two clusters? There are several reasonable definitions, and the choice — called the linkage — changes the shape of the tree. The three you will meet most:

Ward (the default and usually the safe first choice). Merges the two clusters that increase the total within-cluster variance the least. It favors compact, roughly equal-sized, spherical clusters — much like k-means in spirit. Use it as your default for numeric data.
Complete (maximum) linkage. Distance between clusters is the distance between their two farthest points. This is cautious about merging — a cluster only joins another if all its members are reasonably close — so it produces compact clusters but can break up clusters that have any spread.
Average linkage. Distance is the average over all pairs of points, one from each cluster. A middle ground between complete and the more permissive single linkage.

(There is a fourth, single linkage — nearest pair of points — which can chain clusters together through stepping-stone points. It is occasionally useful for elongated shapes but notoriously prone to "chaining" stray points into one giant cluster, so reach for it deliberately, not by default.)

On tidy, well-separated blobs all three linkages agree. The differences surface on messier data: complete linkage insists on tight clusters, average sits in between, and ward keeps things balanced. When a clustering looks odd, switching the linkage is one of the first knobs to try.

Linkage choice changes your answer

There is no universally correct linkage — it encodes an assumption about what a cluster should look like. Ward and complete both prefer compact clusters; single linkage can string distant points into one long chain. Picking a linkage is a modeling decision, not a default to ignore. When in doubt, start with ward for numeric data, look at the dendrogram, and try an alternative if the structure seems wrong.

Building a full dendrogram from the scikit-learn model

You often want both worlds: scikit-learn's flat labels and the dendrogram. scikit-learn can compute the full tree if you ask it to (distance_threshold=0, n_clusters=None, compute_distances=True), after which the merge information lives in model.children_ and model.distances_. SciPy's dendrogram can draw that directly once we pack it into the linkage-matrix format SciPy expects. This helper is worth keeping around.

Now you can get scikit-learn's labels for any k and visualize the same tree they came from — the best of both libraries.

Hierarchical vs k-means: a direct comparison

These two methods are the classic pairing in introductory clustering, and knowing when to pick which is a practical skill.

	K-means	Hierarchical (agglomerative)
Must choose `k` upfront?	Yes	No — cut the tree afterward
Randomness	Yes (needs `n_init`)	No — deterministic
Speed / scale	Fast, scales to millions	Slow, struggles past a few thousand points
Output	Flat labels + centroids	A full tree (dendrogram) you can cut anywhere
Cluster shape	Round / spherical only	Depends on linkage; still distance-based
Best for	Large data, known-ish `k`	Exploring structure, small data, nested groups

Hierarchical clustering does not scale

The honest cost of all that structure is speed and memory. Naive agglomerative clustering compares every pair of points, which grows roughly with the square (or worse) of the number of points. A few thousand points is comfortable; tens of thousands is painful; hundreds of thousands is usually out of the question. When your data is large, k-means (or its mini-batch variant) is the practical choice, and you can still draw a dendrogram on a representative sample to understand the structure.

When to use hierarchical clustering — and when not to

Reach for it when:

You do not know k and want the data to suggest it — slide the cut down the tree to the biggest gap.
The structure is plausibly nested (taxonomies, sub-segments, topic/subtopic hierarchies).
The dataset is small to medium and you value a clear visual map of relationships over raw speed.
You want a deterministic result you can reproduce exactly, with no seed to worry about.

Avoid it when:

The dataset is large — it will be slow and memory-hungry. Use k-means and dendrogram a sample instead.
You need to assign brand-new points to clusters after fitting. K-means can do this instantly (find the nearest centroid); plain agglomerative clustering has no centroids and no natural "predict for a new point," so you would refit.
The clusters are highly non-convex and you are using ward or complete linkage — those still prefer compact shapes, so they share k-means's blind spot for crescents and rings. (Single linkage or density methods like DBSCAN handle those better.)

A note on judging the result

Hierarchical clustering, like k-means, gives you groups without telling you whether they are good. The dendrogram's gaps are a strong visual hint, but to put a number on cluster quality — how separated and cohesive the clusters are — you use metrics like the silhouette score. We keep that deep dive on the Evaluating Clusters page rather than repeating it here; the same metrics apply no matter which clustering algorithm produced the labels.

Common misconceptions

"Hierarchical clustering avoids choosing k." It avoids choosing k before fitting. To get actual flat labels you still pick a cut level — the tree just lets you do it afterward, with the dendrogram as a guide.
"The dendrogram's left-to-right order means something." The horizontal ordering of leaves is partly cosmetic (chosen to avoid crossing branches). Only the vertical structure — what merges with what, and at what height — carries meaning.
"It is always better than k-means because it shows more." It shows more but costs far more. On large data, hierarchical clustering is simply impractical, and k-means wins on speed and the ability to label new points.
"Linkage is a minor technical detail." Linkage encodes what you think a cluster is. Changing it from ward to single can completely reshape the result. It is a real modeling choice.
"You can skip standardizing because there are no centroids." No — it is still a distance-based method. Unscaled features let the largest-magnitude feature dominate every distance, exactly as in k-means. Standardize first.

Your turn

You are given X: 150 unlabeled points forming 3 blobs (the true labels are hidden, as in real clustering).

Standardize X with StandardScaler into X_scaled — hierarchical clustering is distance-based, so features must be on a comparable scale.
Fit AgglomerativeClustering with n_clusters=3 and linkage="ward" on X_scaled. Store the per-point labels in labels (use fit_predict).
Store the number of distinct clusters found in n_found.
Store the size of the LARGEST cluster (its point count) in largest_cluster. Hint: np.bincount(labels) gives the count per label; take its max().

The hidden tests check that you standardized the data, found exactly 3 clusters, labeled all 150 points, and that the largest cluster size is consistent with the data.

Check your understanding

QuestionSelect one

In agglomerative hierarchical clustering, how does the process begin?

With k randomly placed cluster centers

With every point already assigned to one of k groups

With every individual point as its own separate cluster, which are then merged two at a time

With the two farthest points forming the first cluster

QuestionSelect one

In a dendrogram, what does the height at which two branches merge represent?

The number of points in the resulting cluster

The distance between the two clusters at the moment they were merged — higher means they were more different

The order in which the points were loaded

The variance of a single point

QuestionSelect one

What is a genuine advantage of hierarchical clustering over k-means?

It always runs faster on very large datasets

It guarantees spherical clusters

You do not have to commit to the number of clusters before fitting — one tree can be cut at any level to yield different numbers of clusters

It can label brand-new points instantly without refitting

QuestionSelect one

Your dataset has roughly 500,000 points. Why is plain agglomerative hierarchical clustering usually a poor choice here?

It cannot handle numeric features

It requires labels to run

Its time and memory cost grows roughly with the square of the number of points, so hundreds of thousands of points become impractically slow and memory-hungry

Dendrograms can only display exactly 10 points

QuestionSelect one

Before running hierarchical clustering on features measured in different units (e.g. age in years and income in dollars), why should you standardize them first?

Standardizing is only needed for k-means, not hierarchical clustering

It converts the labels into numbers

Hierarchical clustering merges clusters based on distance, and without standardizing, the feature with the largest numeric range dominates every distance — so you scale to give each feature an equal say

Standardizing guarantees the clusters will be perfectly spherical

Hierarchical Clustering

On this page