Assumptions#
1. Clusters are Spherical (Convex & Isotropic)#
K-Means assumes that clusters are round/ball-shaped in the feature space.
Works well when clusters are circular (2D) or spherical (multi-dimensional).
If clusters are elongated, irregular, or non-linear (like moons or spirals), K-Means performs poorly.
👉 Example:
✅ Works: Height vs Weight (clusters look circular).
❌ Fails: Data shaped like two half-moons.
2. Clusters are of Similar Size (Balanced Clusters)#
K-Means assumes that each cluster has a similar number of points.
If one cluster is very large and another is very small, K-Means tends to misclassify the smaller one.
👉 Example: In customer segmentation, if 95% are young customers and 5% are elderly, K-Means may “ignore” the smaller group.
3. Equal Density#
The algorithm assumes all clusters have roughly the same variance (density).
If one cluster is dense and another is sparse, the centroid updates can become biased.
4. Clusters are Linearly Separable#
K-Means implicitly assumes that clusters can be separated using straight lines (decision boundaries).
It struggles with overlapping clusters or clusters with complex boundaries.
👉 Example: It cannot separate spiral-shaped data.
5. Euclidean Distance is Meaningful#
K-Means minimizes squared Euclidean distances, so it assumes that:
Features are on the same scale (standardization is important).
The geometry of the feature space is meaningful.
If features are on very different scales (e.g., income in lakhs vs. age in years), results will be distorted.
6. Number of Clusters (K) is Known#
You must predefine K.
Wrong choice of K leads to poor clustering.
Techniques like Elbow Method or Silhouette Score help estimate K.
7. No Major Outliers#
Outliers can drag centroids away from the true cluster center.
K-Means assumes the dataset does not contain many extreme outliers.
Summary Table of Assumptions
Assumption |
Meaning |
Limitation if Violated |
|---|---|---|
Spherical Clusters |
Clusters are round/convex |
Poor results with irregular shapes |
Similar Size |
Each cluster has similar number of points |
Small clusters ignored |
Equal Density |
Same spread/variance across clusters |
Sparse clusters misclassified |
Linear Separability |
Can be divided by straight lines |
Complex shapes not captured |
Euclidean Distance |
Distance is meaningful, features scaled |
Different scales distort results |
Known K |
Number of clusters must be predefined |
Wrong K → poor clustering |
No Outliers |
Data is relatively clean |
Outliers distort centroids |
In short: K-Means works best when clusters are spherical, equal-sized, equally dense, well-separated, and features are scaled properly.