Assumptions

Assumptions#

K-Means assumes that clusters are round/ball-shaped in the feature space.
Works well when clusters are circular (2D) or spherical (multi-dimensional).
If clusters are elongated, irregular, or non-linear (like moons or spirals), K-Means performs poorly.

👉 Example:

K-Means assumes that each cluster has a similar number of points.
If one cluster is very large and another is very small, K-Means tends to misclassify the smaller one.

👉 Example: In customer segmentation, if 95% are young customers and 5% are elderly, K-Means may “ignore” the smaller group.

The algorithm assumes all clusters have roughly the same variance (density).
If one cluster is dense and another is sparse, the centroid updates can become biased.

K-Means implicitly assumes that clusters can be separated using straight lines (decision boundaries).
It struggles with overlapping clusters or clusters with complex boundaries.

👉 Example: It cannot separate spiral-shaped data.

K-Means minimizes squared Euclidean distances, so it assumes that:
- Features are on the same scale (standardization is important).
- The geometry of the feature space is meaningful.
If features are on very different scales (e.g., income in lakhs vs. age in years), results will be distorted.

Summary Table of Assumptions

Assumption	Meaning	Limitation if Violated
Spherical Clusters	Clusters are round/convex	Poor results with irregular shapes
Similar Size	Each cluster has similar number of points	Small clusters ignored
Equal Density	Same spread/variance across clusters	Sparse clusters misclassified
Linear Separability	Can be divided by straight lines	Complex shapes not captured
Euclidean Distance	Distance is meaningful, features scaled	Different scales distort results
Known K	Number of clusters must be predefined	Wrong K → poor clustering
No Outliers	Data is relatively clean	Outliers distort centroids

In short: K-Means works best when clusters are spherical, equal-sized, equally dense, well-separated, and features are scaled properly.