Hierarchical Clustering#
Hierarchical clustering is an unsupervised machine learning algorithm used to group similar data points into clusters.
Unlike K-Means, you don’t need to specify the number of clusters upfront.
It builds a hierarchy of clusters, often visualized as a dendrogram.
There are two main types:
Agglomerative (Bottom-Up) – most common
Start with each data point as its own cluster.
Iteratively merge the two closest clusters until all points are in one cluster.
Divisive (Top-Down)
Start with all points in one cluster.
Recursively split clusters until each cluster contains a single point.
Agglomerative is far more commonly used in practice.
Workflow of Agglomerative Clustering#
Compute Distance Matrix
Calculate pairwise distances between all points using a metric (Euclidean, Manhattan, Cosine, etc.).
Linkage Criteria
Decide how to measure distance between clusters:
Single linkage: distance between closest points in clusters
Complete linkage: distance between farthest points in clusters
Average linkage: average distance between points in clusters
Ward’s method: minimize variance within merged clusters
Merge Clusters Iteratively
Merge the two clusters with the smallest distance according to the chosen linkage.
Update the distance matrix after each merge.
Build the Dendrogram
Dendrogram is a tree-like diagram showing cluster merges at different distances.
The height of the merge represents distance between clusters.
Cut the Dendrogram to Form Clusters
Decide the number of clusters by choosing a height to cut the dendrogram.
Clusters below that cut are considered separate.
Distance Metrics and Linkage Methods#
Metric |
Description |
|---|---|
Euclidean |
Straight-line distance |
Manhattan |
Sum of absolute differences |
Cosine |
Angle between vectors |
Correlation |
1 - correlation coefficient |
Linkage |
Description |
|---|---|
Single |
Minimum distance between points of clusters |
Complete |
Maximum distance between points of clusters |
Average |
Average distance between all points |
Ward |
Merge clusters to minimize within-cluster variance |
Ward linkage + Euclidean distance is commonly used because it tends to produce clusters of similar size and reduces variance.
Advantages of Hierarchical Clustering#
No need to pre-specify the number of clusters.
Produces dendrogram, which is interpretable.
Can capture nested clusters (hierarchical relationships).
Limitations#
Computationally expensive for large datasets (O(n²) time and memory).
Sensitive to noise and outliers.
Choice of distance metric and linkage affects results.
Intuition#
Think of data points as leaves of a tree:
Agglomerative HC gradually merges leaves into branches → branches into bigger branches → one tree.
Cutting the tree at a certain height gives you clusters.