Random Forest Complexity
Random Forest Computational Complexity
1- Varying Complexity
To analyze Random Forest Complexity, first we must look at Decision Trees which have O(Nlog(N)Pk) complexity for training where N is the sample size, P the feature size and k depth of the tree.
Since Random Forest ensembles multiple trees we also have to add that to the equation and it becomes n_estimator*O(Nlog(N)Pk). This means Random Forest complexity can be anywhere between Quasilinear and Cubic Time Complexity O(N^3) depending on many variables.
Runtime Performance of Random Forest Training
2- Random Forest Speed Tests
We have made speed tests to provide a runtime estimation for anyone who is interested in performance of Random Forest algorithm. These values can vary depending on dataset, computational resources and model parameters. Our Random Forest speed tests yielded following results on a classification problem :
56 features, n_estimators = 100, n_jobs = 1
Random Forest (50K): 23.40 seconds
Random Forest (100K): 63.50 seconds
Random Forest (250K): 271.34 seconds
Random Forest (500K): 665.25 seconds
56 features, n_estimators = 100, n_jobs = –1
Random Forest (50K): 4.76 seconds
Random Forest (100K): 11.87 seconds
Random Forest (250K): 49.66 seconds
Random Forest (500K): 111.33 seconds
56 features, n_estimators = 10, n_jobs = –1
Random Forest (50K): 1.28 seconds
Random Forest (100K): 2.09 seconds
Random Forest (250K): 6.59 seconds
Random Forest (500K): 13.84 seconds
Random Forest (1M): 15.52 seconds
max_features = “log2”, n_estimators = 10, n_jobs = –1
Random Forest (500K): 11.35 seconds
Random Forest (1M): 13.20 seconds
bootstrap= False, max_features = “log2”, n_estimators = 10, n_jobs = –1
Random Forest (500K): 14.00 seconds
Random Forest (1M): 15.54 seconds
Lots of opportunities for optimization
3- Data Size
Random Forests have slightly sluggish training performance especially when hyperparameters are not optimized. But even then, it scales relatively well in most cases and if training performance is not a big deal breaker it will be able to predict quite fast to make up for it.
With correct settings Random Forest can handle millions of rows and hundreds of features without any issues. If training performance is a very big concern for the Machine Learning or Data Science project at hand, you can also check our Naive Bayes which has ultra fast training phase as well as inference.
Note: the tests were done with consumer tools (i7 8th Gen processor, 16GB RAM) and might not be very sensitive especially on the very quick end. Data reading and loading times are subtracted from ultimate performances.
You can also avoid bootstrapping (creating random subsets of samples for tree training) but this is unlikely to result in huge performance gains for most applications and you might end up losing accuracy.
Features can also be limited and it’s great to have this flexibility. However, we must note that Scikit-Learn’s Random Forest Classifier as well as Random Forest Regressor uses a pretty optimum “auto” technique while going through features for training.
Overall, we can conclude that decision trees have intermediate performance results for training phase. While they are not the slowest machine learning algorithms, they can struggle when data gets too big especially dimensionally. But since the results are still tolerable and there are ways to optimize trees we can also conclude that their training performance will be adequate in most cases even with big data applications.
Aside of training, decision trees have linear time complexity for inference (prediction phase). And this makes them very fast and resource efficient during this stage. Decision trees can be suitable even for real time machine learning deployment.
We have done a few decision tree performance tests to give a better idea of how decision tree algorithms scale and what’s their runtime performance like.
There are multiple ways to improve decision tree performance by tuning its hyperparameters. You can check out the article below: