Hyperparameter tuning#
The main hyperparameter in K-Means is k (number of clusters). Choosing the right value of k is critical because it directly affects model performance. There are also auxiliary parameters like init, max_iter, and n_init.
1. Choosing k#
Elbow Method: Plot the cost function (Within-Cluster-Sum-of-Squares, WCSS) against different values of
k. The “elbow point” indicates a good trade-off between variance explained and complexity.Silhouette Score: Measures cohesion (how close points are in a cluster) vs separation (how far clusters are apart). Higher silhouette = better cluster quality.
Gap Statistic: Compares WCSS of actual clustering with WCSS of randomly generated data.
2. Initialization method (init)#
k-means++ (default): Ensures initial centroids are well spread out, improving convergence.
Random: Risk of poor local minima, but with
n_init > 1, mitigated.
3. Number of runs (n_init)#
Run clustering multiple times with different centroid seeds, then choose the best outcome. Higher
n_initreduces sensitivity to bad initialization.
4. Maximum iterations (max_iter)#
Ensures convergence. Usually defaults (like 300) are sufficient, but can be tuned for speed vs stability.
Handling Overfitting and Underfitting in K-Means#
Though clustering is unsupervised (no labels), the concepts of underfitting/overfitting still apply in terms of cluster quality.
Underfitting (too simple clusters)#
Cause:
Choosing too few clusters (
ktoo small).Poor initialization.
Symptoms:
High WCSS (large distances within clusters).
Low silhouette score.
Fix:
Increase
k.Use
k-means++instead of random init.Increase
n_init.
Overfitting (too many clusters)#
Cause:
Choosing
ktoo large.
Symptoms:
Very low WCSS but clusters don’t generalize (each point may form its own cluster).
Silhouette score decreases after some point.
Fix:
Use elbow method or silhouette score to cap
k.Regularize by preferring simpler solutions (Occam’s razor).
In short
Tune
kcarefully using elbow, silhouette, or gap statistic.Use
k-means++andn_init > 10for stability.Balance
kto avoid under/overfitting: too low = coarse clusters, too high = fragmented clusters.