1- Linear to Quadratic Complexity
K-Means has O(N*P*K) complexity for each iteration where N is the observation size (rows), P is the column size and K is the centroid amounts. This means if data is not dimensionally big, K-Means can have Linear Complexity and if data gets very dimensional theoretically time complexity can go up to Quadratic.
For a K-Means model time complexity mentioned above will be multiplied by iteration amount after which complexity can be expressed as: O(N*P*K*i) where i is the iteration amount.
K-Means Runtime Performance
2- So is K-Means fast?
In short, yes. K-Means is an algorithm with fast runtime performance. There is no training phase so we’d be talking about inference phase performance and complexity only.
Runtime Speed Performances:
56 features, max_iter=300, init=“k-means++“
K-Means (50K): 3.14 seconds
K-Means (100K): 4.66 seconds
K-Means (250K): 13.04 seconds
K-Means (500K): 26.48 seconds
K-Means (1M): 27.23 seconds
2 features, max_iter=300, init=“k-means++“
K-Means (50K): 1.89 seconds
K-Means (100K): 3.28 seconds
K-Means (250K): 7.27 seconds
K-Means (500K): 16.30 seconds
K-Means (1M): 19.55 seconds
Please note: Tests were done with a fairly fast household computer (i7 8th Gen processor, 16GB RAM). There can be different factors that can affect these computation results especially on the lower end sensitivity (Fast runtime with small data).
3- Inference Only
K-Means doesn’t have a training phase and it can be used for inference (clustering) straight away. In Scikit-Learn it all takes place when we fit a KMeans model and model will produce clusters and centroids.
This is an additional bonus to the speed performance of K-Means algorithm. But K-Means can be on the slower side if data is too big with too many dimensions. It has very similar calculations and complexity to kNN algorithm. Unlike kNN though K-Means runtime performance is usually tolerated well in projects since clustering is performed on the whole dataset and less frequently.
K-Means algorithm doesn’t have a training phase so it also doesn’t have a training phase time complexity.
What happens during when model is fit is KMeans will be introduced to data and distance calculations as well as centroid initialization and K-Means iteration will be carried out.
Basically, the whole process takes place during fitting and clusters and centroids will be created.
K-Means algorithm’s inference (clustering) complexity can vary between Linear Complexity and Quadratic Complexity.
In big O notation it can be shown as: O(N*P*K). N as sample size, P as column size and K as cluster amount.
A fundamental model for clustering
In summary, K-Means complexity depends on dimensionality but in most cases it will scale well.
Its runtime performance is usually fast and well justified for the job done.
K-Means can be slightly improved by tuning its parameters but K-Means parameters are vital and it can be risky to focus K-Means tuning based on performance. (It’s more common to tune K-Means for clustering accuracy). For more details, see: