Hyperparameter tuning

Hyperparameter tuning #

The main hyperparameter in K-Means is k (number of clusters). Choosing the right value of k is critical because it directly affects model performance. There are also auxiliary parameters like init, max_iter, and n_init.

1. Choosing `k`#

Elbow Method: Plot the cost function (Within-Cluster-Sum-of-Squares, WCSS) against different values of k. The “elbow point” indicates a good trade-off between variance explained and complexity.
Silhouette Score: Measures cohesion (how close points are in a cluster) vs separation (how far clusters are apart). Higher silhouette = better cluster quality.
Gap Statistic: Compares WCSS of actual clustering with WCSS of randomly generated data.

2. Initialization method (`init`)#

k-means++ (default): Ensures initial centroids are well spread out, improving convergence.
Random: Risk of poor local minima, but with n_init > 1, mitigated.

3. Number of runs (`n_init`)#

Run clustering multiple times with different centroid seeds, then choose the best outcome. Higher n_init reduces sensitivity to bad initialization.

4. Maximum iterations (`max_iter`)#

Ensures convergence. Usually defaults (like 300) are sufficient, but can be tuned for speed vs stability.

Handling Overfitting and Underfitting in K-Means #

Though clustering is unsupervised (no labels), the concepts of underfitting/overfitting still apply in terms of cluster quality.

Underfitting (too simple clusters)#

Cause:
- Choosing too few clusters (k too small).
- Poor initialization.
Symptoms:
- High WCSS (large distances within clusters).
- Low silhouette score.
Fix:
- Increase k.
- Use k-means++ instead of random init.
- Increase n_init.

Overfitting (too many clusters)#

Cause:
- Choosing k too large.
Symptoms:
- Very low WCSS but clusters don’t generalize (each point may form its own cluster).
- Silhouette score decreases after some point.
Fix:
- Use elbow method or silhouette score to cap k.
- Regularize by preferring simpler solutions (Occam’s razor).

In short

Tune k carefully using elbow, silhouette, or gap statistic.
Use k-means++ and n_init > 10 for stability.
Balance k to avoid under/overfitting: too low = coarse clusters, too high = fragmented clusters.

Hyperparameter tuning

Contents