Assumptions#

1. Clusters are Spherical (Convex & Isotropic)#

  • K-Means assumes that clusters are round/ball-shaped in the feature space.

  • Works well when clusters are circular (2D) or spherical (multi-dimensional).

  • If clusters are elongated, irregular, or non-linear (like moons or spirals), K-Means performs poorly.

👉 Example:

  • ✅ Works: Height vs Weight (clusters look circular).

  • ❌ Fails: Data shaped like two half-moons.


2. Clusters are of Similar Size (Balanced Clusters)#

  • K-Means assumes that each cluster has a similar number of points.

  • If one cluster is very large and another is very small, K-Means tends to misclassify the smaller one.

👉 Example: In customer segmentation, if 95% are young customers and 5% are elderly, K-Means may “ignore” the smaller group.


3. Equal Density#

  • The algorithm assumes all clusters have roughly the same variance (density).

  • If one cluster is dense and another is sparse, the centroid updates can become biased.


4. Clusters are Linearly Separable#

  • K-Means implicitly assumes that clusters can be separated using straight lines (decision boundaries).

  • It struggles with overlapping clusters or clusters with complex boundaries.

👉 Example: It cannot separate spiral-shaped data.


5. Euclidean Distance is Meaningful#

  • K-Means minimizes squared Euclidean distances, so it assumes that:

    • Features are on the same scale (standardization is important).

    • The geometry of the feature space is meaningful.

  • If features are on very different scales (e.g., income in lakhs vs. age in years), results will be distorted.


6. Number of Clusters (K) is Known#

  • You must predefine K.

  • Wrong choice of K leads to poor clustering.

  • Techniques like Elbow Method or Silhouette Score help estimate K.


7. No Major Outliers#

  • Outliers can drag centroids away from the true cluster center.

  • K-Means assumes the dataset does not contain many extreme outliers.


Summary Table of Assumptions

Assumption

Meaning

Limitation if Violated

Spherical Clusters

Clusters are round/convex

Poor results with irregular shapes

Similar Size

Each cluster has similar number of points

Small clusters ignored

Equal Density

Same spread/variance across clusters

Sparse clusters misclassified

Linear Separability

Can be divided by straight lines

Complex shapes not captured

Euclidean Distance

Distance is meaningful, features scaled

Different scales distort results

Known K

Number of clusters must be predefined

Wrong K → poor clustering

No Outliers

Data is relatively clean

Outliers distort centroids


In short: K-Means works best when clusters are spherical, equal-sized, equally dense, well-separated, and features are scaled properly.