Skip to content

Random Forest Complexity

Random Forest Computational Complexity

1- Varying Complexity

To analyze Random Forest Complexity, first we must look at Decision Trees which have O(Nlog(N)Pk) complexity for training where N is the sample size, P the feature size and k depth of the tree. 

Since Random Forest ensembles multiple trees we also have to add that to the equation and it becomes n_estimator*O(Nlog(N)Pk). This means Random Forest complexity can be anywhere between Quasilinear and Cubic Time Complexity O(N^3) depending on many variables.

Runtime Performance of Random Forest Training

2- Random Forest Speed Tests

We have made speed tests to provide a runtime estimation for anyone who is interested in performance of Random Forest algorithm. These values can vary depending on dataset, computational resources and model parameters. Our Random Forest speed tests yielded following results on a classification problem :

56 features, n_estimators = 100, n_jobs = 1
Random Forest (50K): 23.40 seconds
Random Forest (100K): 63.50 seconds
Random Forest (250K): 271.34 seconds
Random Forest (500K): 665.25 seconds

56 featuresn_estimators = 100n_jobs = –1
Random Forest (50K): 4.76 seconds
Random Forest (100K): 11.87 seconds
Random Forest (250K): 49.66 seconds
Random Forest (500K): 111.33 seconds

56 featuresn_estimators = 10n_jobs = –1
Random Forest (50K): 1.28 seconds
Random Forest (100K): 2.09 seconds
Random Forest (250K): 6.59 seconds
Random Forest (500K): 13.84 seconds
Random Forest (1M): 15.52 seconds

max_features = “log2”n_estimators = 10n_jobs = –1

Random Forest (500K): 11.35 seconds
Random Forest (1M): 13.20 seconds

bootstrapFalsemax_features = “log2”n_estimators = 10n_jobs = –1

Random Forest (500K): 14.00 seconds
Random Forest (1M): 15.54 seconds

Lots of opportunities for optimization

3- Data Size

Random Forests have slightly sluggish training performance especially when hyperparameters are not optimized. But even then, it scales relatively well in most cases and if training performance is not a big deal breaker it will be able to predict quite fast to make up for it.

With correct settings Random Forest can handle millions of rows and hundreds of features without any issues. If training performance is a very big concern for the Machine Learning or Data Science project at hand, you can also check our Naive Bayes which has ultra fast training phase as well as inference.

4. Conclusion

n_jobs

You can see how runtime benefits from parallelization which we’ve enabled by setting n_jobs to -1. You can read our article below for more explanation on tuning Random Forest hyperparameters.

Note: the tests were done with consumer tools (i7 8th Gen processor, 16GB RAM) and might not be very sensitive especially on the very quick end. Data reading and loading times are subtracted from ultimate performances.

n_estimators

Another critical parameter that can be adjusted. Default parameter 100 which means 100 trees will be trained is usually an overkill and can be lowered to improve the training performance.

bootstrap

You can also avoid bootstrapping (creating random subsets of samples for tree training) but this is unlikely to result in huge performance gains for most applications and you might end up losing accuracy.

max_features

Features can also be limited and it’s great to have this flexibility. However, we must note that Scikit-Learn’s Random Forest Classifier as well as Random Forest Regressor uses a pretty optimum “auto” technique while going through features for training.

Summary

Overall, we can conclude that decision trees have intermediate performance results for training phase. While they are not the slowest machine learning algorithms, they can struggle when data gets too big especially dimensionally. But since the results are still tolerable and there are ways to optimize trees we can also conclude that their training performance will be adequate in most cases even with big data applications.

Aside of training, decision trees have linear time complexity for inference (prediction phase). And this makes them very fast and resource efficient during this stage. Decision trees can be suitable even for real time machine learning deployment.

We have done a few decision tree performance tests to give a better idea of how decision tree algorithms scale and what’s their runtime performance like.

There are multiple ways to improve decision tree performance by tuning its hyperparameters. You can check out the article below: