K-Means Clustering#
K-Means is one of the most popular unsupervised learning algorithms used for clustering.
It partitions a dataset into K clusters (groups).
Each cluster has a centroid (the “center” of the cluster).
Data points are assigned to the cluster whose centroid is closest (based on distance, usually Euclidean).
It’s like grouping students into K study groups based on how close their marks are in math and science.
The K-Means Algorithm (Step-by-Step)#
Choose K → the number of clusters you want.
Initialize centroids → randomly select K points as initial centroids.
Assign points to nearest centroid → each data point joins the cluster with the closest centroid (using Euclidean distance).
Update centroids → recalculate the centroid of each cluster as the mean of its points.
Repeat steps 3–4 until:
Centroids stop changing significantly (convergence), or
Max iterations reached.
Mathematical Intuition#
The objective function minimized by K-Means is the Within-Cluster Sum of Squares (WCSS):
Where:
\(K\) = number of clusters
\(C_i\) = set of points in cluster \(i\)
\(\mu_i\) = centroid of cluster \(i\)
👉 Goal: minimize distance of all points from their cluster centroids.
Choosing the Right K (Number of Clusters)#
Elbow Method → plot WCSS vs. K, look for an “elbow” point.
Silhouette Score → measures how similar a point is to its own cluster vs. others.
Gap Statistics → compares clustering performance against random data.
Advantages#
✅ Simple and easy to implement. ✅ Works well with large datasets. ✅ Converges quickly.
Limitations#
❌ Must pre-define K. ❌ Sensitive to initialization (bad initial centroids → poor results). ❌ Struggles with non-spherical clusters or clusters of different densities. ❌ Sensitive to outliers.
Example Use Cases#
Customer segmentation in marketing.
Document or image clustering.
Anomaly detection.
Grouping genes with similar expression patterns in biology.
Quick Intuition Example: Imagine you have data on people’s height and weight. If you run K-Means with \(K=3\):
Cluster 1 might group short & light people.
Cluster 2 might group medium height & medium weight people.
Cluster 3 might group tall & heavy people.
The algorithm finds these groups automatically without labels.