K-Means is an unsupervised learning algorithm that can be used for clustering. That means it’s used on data with no labels and it doesn’t require a training process. Clustering algorithms such as K-Means can be used to create clusters and extract meanings from unstructured data.
When you combine these two characteristic traits K-means become a fantastic method to obtain additional insight where other machine learning algorithms wouldn’t be able to. From that perspective K-Means doesn’t compete with many popular supervised machine learning algorithms (such as knn, linear models, svm, decision trees, random forests etc.) and navigates in its own lane.
The fact that K-Means is also very easy to implement, understand and tweak if necessary makes it a very popular and useful unsupervised machine learning algorithm. That being said, all machine learning algorithms have their own sort of fame and coolness to them.
In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it tends to fall in local minima. That’s why it can be useful to run it several times which is feasible since its fast.
Main advantage of K-Means is the opportunity of gaining knowledge from datasets without any labels. Normally, a human might have a difficulty making any meaning of many columns with random looking numbers.
But when analyzed with Clustering techniques in machine learning such datasets become more valuable and meaningful. Additionally K-Means is quite fast compared to other algorithms and it’s also very easy to use and interpret.
Also it doesn’t matter if data is linear or not to K-Means machine learning model, it will cluster linear and non-linear data equally well.
Some of the common applications with K-Means are:
K-Means emerged during 1950s and 1960s in the works of multiple independent individuals from multiple domains. However, James MacQueen from University of California was the first to mention the term K-Means in 1967 in his research paper.
You can find the original paper in this article:
K-Means has O(N*P*K) complexity for each iteration where N is the observation size (rows), P is the column size and K is the centroid amounts. This means K-Means time complexity can change from Linear Complexity to Quadratic Complexity.
For a K-Means model time complexity mentioned above will be multiplied by iteration amount after which complexity can be expressed as: O(N*P*K*i) where i is the iteration amount.
Runtime Speed Performances:
56 columns, max_iter=300, init=“k-means++“
K-Means (50K): 3.14 seconds
K-Means (500K): 26.48 seconds
K-Means (1M): 27.23 seconds
You can see a more comprehensive analysis of K-Means Complexity and Runtime Performances in this article:
Using Scikit-Learn’s cluster module you can create K-Mean Clustering Models very easily. K-Means clustering is a very intuitive and straightforward process and it offers great insight into unlabeled unstructered datasets.
Or even if data is structured it can be used to compliment findings of Supervised Machine Learning algorithms and create hybrid projects in terms of machine learning technique.
You can check out this tutorial to see you can simply create and use a K-Means model using Scikit-Learn library and Python:
In some situations it can be very helpful to create a more custom K-Means model by adjusting and tuning the parameters of KMeans class in Scikit-Learn. These techniques can help you create a clustering model that caters better to the needs of your project. For Tuning K-Means models and K-Means Optimization please refer to the next section.
K-Means Model’s Scikit-Learn implementation comes with a pretty ideal optimization. However, you can still tune a few parameters and adapt K-Means algorithm to your liking and to your project.
Another benefit of tuning K-Means is that it really helps understand the algorithm and how it is constructed. For example init parameter can be used to define centroid initialization algorithm. This makes you really think about what centroids are and how they work. By default init is assigned to “k-means++” a popular algorithm for centroid initiation that tries to ensure ideal initial positions for each cluster center.
Some of the most commonly adjuster K-Means parameters and hyperparameters are:
You can read more about K-Means Optimization in the article below:
We prepared a K-Means Implementation example where you can see how K-Means can be used to create clusters with unlabeled data. You can also find useful K-Means Visualization and some K-Means optimization techniques in the same example. Please see page below:
K-Means, Hierarchical Clustering and DBSCAN all have different functions regarding data clustering. These clustering algorithms compliment each other and they are suitable for different cases.
For example, K-Means can be used to create spherical clusters that are separated while DBSCAN can create clusters with arbitrary shapes or overlapping clusters.
Hierarchical Clustering works in a way that’s similar to descision tree structures and it can create clusters with multi-levels in terms of depth. You can then pick any cluster depth level you’d like. Also Hierarchical clustering doesn’t have initation parameters such as centroid amount or density amount which makes it easier to use sometimes.
Besides DBSCAN can be useful to identify outliers since it works based on density and it doesn’t include every single parameter in the clusters like K-Means does.