Hierarchical Clustering

Hierarchical Clustering #

Hierarchical clustering is an unsupervised machine learning algorithm used to group similar data points into clusters.

There are two main types:

Agglomerative (Bottom-Up) – most common
- Start with each data point as its own cluster.
- Iteratively merge the two closest clusters until all points are in one cluster.
Divisive (Top-Down)
- Start with all points in one cluster.
- Recursively split clusters until each cluster contains a single point.

Agglomerative is far more commonly used in practice.

Compute Distance Matrix
- Calculate pairwise distances between all points using a metric (Euclidean, Manhattan, Cosine, etc.).
Linkage Criteria
- Decide how to measure distance between clusters:
  - Single linkage: distance between closest points in clusters
  - Complete linkage: distance between farthest points in clusters
  - Average linkage: average distance between points in clusters
  - Ward’s method: minimize variance within merged clusters
Merge Clusters Iteratively
- Merge the two clusters with the smallest distance according to the chosen linkage.
- Update the distance matrix after each merge.
Build the Dendrogram
- Dendrogram is a tree-like diagram showing cluster merges at different distances.
- The height of the merge represents distance between clusters.
Cut the Dendrogram to Form Clusters
- Decide the number of clusters by choosing a height to cut the dendrogram.
- Clusters below that cut are considered separate.

Linkage	Description
Single	Minimum distance between points of clusters
Complete	Maximum distance between points of clusters
Average	Average distance between all points
Ward	Merge clusters to minimize within-cluster variance

Ward linkage + Euclidean distance is commonly used because it tends to produce clusters of similar size and reduces variance.

Think of data points as leaves of a tree:
- Agglomerative HC gradually merges leaves into branches → branches into bigger branches → one tree.
Cutting the tree at a certain height gives you clusters.