DBSCAN Computational Complexity
1- Quadratic Complexity
DBSCAN has Quadratic Time Complexity that can be shown as O(n^2) for worst-case scenario. Algorithm also has Ω(n*logn) time complexity for best case scenario.
Based on this information we can see that DBSCAN won’t scale well and it might struggle with large datasets however there are a number of ways this can be addressed and DBSCAN complexity as well as runtime performance can be improved.
a) Improving Complexity
There has been exciting developments regarding improving the time complexity of DBSCAN clustering algorithm.
In a paper by Junhao Gan and Yufei Tao of Hong Kong University, researchers were able to significantly improve the time complexity of DBSCAN algorithm with a tolerable additional error rate. You can see the paper here:
Authors also reasonably argue that this new DBSCAN should be the norm for big data applications since the original DBSCAN will take too long to practically work with.
b) n_jobs parameter
n_jobs is a very useful parameter that can be used for parallelizing DBSCAN algorithm. It is 1 by default in Scikit-Learn implementation which means the model only uses one processor core.
This is far from ideal especially for a sluggish algorithm like DBSCAN it can be so useful to utilize all the computation power possible. Parallelizing is a great way to optimize the performance of DBSCAN algorithm. You can assign it to -1 to utilize all the processor cores at the same time.
You can see the performance gains on a 12 core i7 computer below. (approximately 6x performance gain)
Runtime Performance of DBSCAN
2- DBSCAN Speed Tests
DBSCAN resembles supervised kNN and unsupervised K-Means algorithms. Other algorithms that harvest distance and make use of distance calculations. They all seemingly have a common drawback which is scaling issues.
For all the mentioned machine learning algorithms scaling issues and runtime performance can thankfully be improved, to a degree, by tuning the hyperparameters. DBSCAN is no different. You can see our article below regarding tuning opportunities for DBSCAN algorithm:
Below you can see the results of a series of runtime tests we have done to give an idea regarding the time it will take DBSCAN for clustering operations in different cases.
Runtime Speed Performances:
56 columns, eps = 0.5, min_samples = 5, n_jobs=1
DBSCAN (50K): 18.56 seconds
DBSCAN (100K): 57.06 seconds
DBSCAN (250K): 178.36 seconds
DBSCAN (500K): 431.72 seconds
DBSCAN (1M): 521.29 seconds
56 columns, eps = 0.5, min_samples = 5, n_jobs = –1
DBSCAN (50K): 3.47 seconds
DBSCAN (100K): 9.84 seconds
DBSCAN (250K): 28.55 seconds
DBSCAN (500K): 72.19 seconds
DBSCAN (1M): 81.58 seconds
Please note: Tests were done with a generic computer (i7 8th Gen processor, 16GB RAM). There can be different factors that can affect these computation results especially on the lower end of sensitivity (Fast runtime with small data).
Lots of opportunities for optimization
3- Data Size
DBSCAN complexity and runtime performance should be tolerable in most cases for datasets with under 1 million samples. Considering cluster algorithms aren’t very likely to be deployed and used in real-time makes DBSCAN’s performance more even tolerable. In most cases user will likely want to create a number of clusters and work with the results rather than continuous inference.
Having said that if data at hand is big data with hundreds of millions of rows or even billions DBSCAN will be quite difficult to work with. Some optimization opportunities most notably n_jobs parameter can be handy in all cases but it might not be enough when work with big data.
Additionally, a method discussed above can be used to dramatically improve DBSCAN’s time complexity from O(N^2) to O(N). This implementation of DBSCAN is named ρ-approximate DBSCAN by the authors of research study DBSCAN Revisited.
Aside of the runtime discussions, DBSCAN is also an algorithm that’s quite heavy on the memory. It has a space complexity of also O(N^2) and average consumer computer is likely to struggle with memory use of DBSCAN past 1 million sample point.
DBSCAN continues to provide additional value as a unique clustering algorithm that can cluster arbitrary shapes at a tolerable time complexity and runtime performance.
To read more about the advantages of DBSCAN algorithm you can see the article below:
Overall, we can conclude that decision trees have intermediate performance results for training phase. While they are not the slowest machine learning algorithms, they can struggle when data gets too big especially dimensionally. But since the results are still tolerable and there are ways to optimize trees we can also conclude that their training performance will be adequate in most cases even with big data applications.
Aside of training, decision trees have linear time complexity for inference (prediction phase). And this makes them very fast and resource efficient during this stage. Decision trees can be suitable even for real time machine learning deployment.
We have done a few decision tree performance tests to give a better idea of how decision tree algorithms scale and what’s their runtime performance like.
There are multiple ways to improve decision tree performance by tuning its hyperparameters. You can check out the article below: