Hyperparameter tuning#

The main hyperparameter in K-Means is k (number of clusters). Choosing the right value of k is critical because it directly affects model performance. There are also auxiliary parameters like init, max_iter, and n_init.

1. Choosing k#

  • Elbow Method: Plot the cost function (Within-Cluster-Sum-of-Squares, WCSS) against different values of k. The “elbow point” indicates a good trade-off between variance explained and complexity.

  • Silhouette Score: Measures cohesion (how close points are in a cluster) vs separation (how far clusters are apart). Higher silhouette = better cluster quality.

  • Gap Statistic: Compares WCSS of actual clustering with WCSS of randomly generated data.

2. Initialization method (init)#

  • k-means++ (default): Ensures initial centroids are well spread out, improving convergence.

  • Random: Risk of poor local minima, but with n_init > 1, mitigated.

3. Number of runs (n_init)#

  • Run clustering multiple times with different centroid seeds, then choose the best outcome. Higher n_init reduces sensitivity to bad initialization.

4. Maximum iterations (max_iter)#

  • Ensures convergence. Usually defaults (like 300) are sufficient, but can be tuned for speed vs stability.


Handling Overfitting and Underfitting in K-Means#

Though clustering is unsupervised (no labels), the concepts of underfitting/overfitting still apply in terms of cluster quality.

Underfitting (too simple clusters)#

  • Cause:

    • Choosing too few clusters (k too small).

    • Poor initialization.

  • Symptoms:

    • High WCSS (large distances within clusters).

    • Low silhouette score.

  • Fix:

    • Increase k.

    • Use k-means++ instead of random init.

    • Increase n_init.

Overfitting (too many clusters)#

  • Cause:

    • Choosing k too large.

  • Symptoms:

    • Very low WCSS but clusters don’t generalize (each point may form its own cluster).

    • Silhouette score decreases after some point.

  • Fix:

    • Use elbow method or silhouette score to cap k.

    • Regularize by preferring simpler solutions (Occam’s razor).


In short

  • Tune k carefully using elbow, silhouette, or gap statistic.

  • Use k-means++ and n_init > 10 for stability.

  • Balance k to avoid under/overfitting: too low = coarse clusters, too high = fragmented clusters.