K-Means Clustering#

K-Means is one of the most popular unsupervised learning algorithms used for clustering.

  • It partitions a dataset into K clusters (groups).

  • Each cluster has a centroid (the “center” of the cluster).

  • Data points are assigned to the cluster whose centroid is closest (based on distance, usually Euclidean).

It’s like grouping students into K study groups based on how close their marks are in math and science.


The K-Means Algorithm (Step-by-Step)#

  1. Choose K → the number of clusters you want.

  2. Initialize centroids → randomly select K points as initial centroids.

  3. Assign points to nearest centroid → each data point joins the cluster with the closest centroid (using Euclidean distance).

  4. Update centroids → recalculate the centroid of each cluster as the mean of its points.

  5. Repeat steps 3–4 until:

    • Centroids stop changing significantly (convergence), or

    • Max iterations reached.


Mathematical Intuition#

  • The objective function minimized by K-Means is the Within-Cluster Sum of Squares (WCSS):

\[ J = \sum_{i=1}^{K} \sum_{x \in C_i} \|x - \mu_i\|^2 \]

Where:

  • \(K\) = number of clusters

  • \(C_i\) = set of points in cluster \(i\)

  • \(\mu_i\) = centroid of cluster \(i\)

👉 Goal: minimize distance of all points from their cluster centroids.


Choosing the Right K (Number of Clusters)#

  1. Elbow Method → plot WCSS vs. K, look for an “elbow” point.

  2. Silhouette Score → measures how similar a point is to its own cluster vs. others.

  3. Gap Statistics → compares clustering performance against random data.


Advantages#

✅ Simple and easy to implement. ✅ Works well with large datasets. ✅ Converges quickly.


Limitations#

❌ Must pre-define K. ❌ Sensitive to initialization (bad initial centroids → poor results). ❌ Struggles with non-spherical clusters or clusters of different densities. ❌ Sensitive to outliers.


Example Use Cases#

  • Customer segmentation in marketing.

  • Document or image clustering.

  • Anomaly detection.

  • Grouping genes with similar expression patterns in biology.


Quick Intuition Example: Imagine you have data on people’s height and weight. If you run K-Means with \(K=3\):

  • Cluster 1 might group short & light people.

  • Cluster 2 might group medium height & medium weight people.

  • Cluster 3 might group tall & heavy people.

The algorithm finds these groups automatically without labels.

Click here for Sections