Random Forest
Random Forest Tutorial
1- Overview
Random Forest is an extremely popular algorithm and there are good reasons for that. It’s very easy to become a fan of Random Forest algorithm and its applications. We will explain why it is so in this Random Forest tutorial series.
Random Forest works very well out of the box requiring little adjustment and optimization. It offers many optimization parameters that work very well. It’s relatively fast especially for inference, its performance can be improved for training through hyperparameter tuning and it has a flexible computation complexity that makes it possible to scale when needed.
Where Random Forests really shine though, is probably how they work with almost all kinds of data regardless of missing values, bias, noise, feature scales and even non-linearity.
Ensembling, Randomness and Decision Trees
Random Forest algorithm stands on the shoulders of a giant: Decision Tree Algorithm. Although decision tree algorithms have been used for a very long time with successful results. Random Forest really raised the bar for tree implementations as well as ensemble techniques.
Random Forests are Ensembles based on Decision Trees. To understand Random Forest, we need to first understand two things very well: 1) Ensembles 2) Decision Trees
What is an Ensemble?
Ensemble is a French words which simply means together. In English its usage is related to the technical terminology but it pretty much means the same thing.
In Machine Learning, ensemble algorithms are a parent category for techniques that combine multiple machine learning algorithms or aims to make them more useful by gathering them together and making use of multiple predictions.
What is a Decision Tree? (from Random Forest Perspective)
Decision Tree tree is a machine learning algorithm that has a mechanism that’s similar to growing trees. However, this resemblance can make things trickier. First of all, Machine Learning trees are upside down. The root of the tree comes at the top, it grows upside down and splits into internal nodes or leaves through branches. If a node is final it’s called a leaf and if it’s not final it will be an internal node that originates from the root node.
So, why are decision trees important to understand random forests? Because, random forest is nothing but multiple random decision trees glued together for increased accuracy and prediction power.
This page focuses on Random Forest Implementation and Characteristics, if you are interested in working with Decision Trees you can see this tutorial:
Why Support Random Forest Algorithm?
2- Random Forest Benefits
- Increased Accuracy
- Control over trees and ensembling through parameters
- Works with non-linear data very well.
- Increased application area through Random Forest Regressor and Random Forest Classifier
- Simple to inexistent pre-processing procedures.
-
- Can handle missing data
- Doesn’t require normalization
Random Forest Pros
- More Robust
- More Accurate
- Handles data with missing values, bias, noise and no need for scaling.
Random Forest Cons
- Less interpretable
- High training time complexity
- Might overfit with wrong parameter settings
Random Forest Application Areas
3- Key Industries
Random Forest is a very popular algorithm and it finds use case in many domains solving varying problems.
Decision Trees are used heavily in all domains through tree-based algorithms such as AdaBoost, XGBoost and Random Forests.
Random Forests are commonly used in many industries and disciplines. Some examples would include following but not limited to this list:
- Data Science
- Banking Sector (Retail and Commercial)
- Science
- Medicine
- Business Analysis
- Internet Technologies including e-commerce
- Financial Market Predictions (including stock markets, energy commodities markets, derivative trading etc.)
Who Found Support Vector Machine?
4- Random Forest History
Leo Breiman co-invented Random Forest machine learning algorithm with Adele Cutler based on Decision Trees which was also initially introduced in a paper Leo Breiman co-authored.
You can access the original Random Forest paper on this page:
Are Support Vector Machines Fast?
5- Computational Complexity
Random Forest can have variety of time complexities which make them even more critical to tune properly. For example, we have carried out a few runtime tests to provide a guidance regarding Random Forest performance under different conditions and you can see how the results can vary from a partial view of the tests below:
- 56 features, n_estimators = 100, n_jobs = 1
- Random Forest (500K): 665.25 seconds
- Random Forest (1M): 776.90 seconds
- max_features = “log2”, n_estimators = 10, n_jobs = –1
- Random Forest (500K): 11.35 seconds
- Random Forest (1M): 13.20 seconds
You can check out the complete version here of the Random Forest Runtime Speed Tests here:
Random Forests can have somewhat sluggish training performance which can be addressed via tuning and they are adequately fast during inference which contributes to their widespread adaptation and utilization.
How to Use Random Forests?
6- Scikit-Learn Random Forest Implementation
To create Random Forest models you can use Scikit-learn, a machine learning library in Python which makes all kind of machine learning tasks a breeze.
Random Forests are a more robust and versatile alternative to many machine learning algorithms including Decision Trees and you can do classification as well as regression predictions with them.
We have these simplified Python tutorials for Random Forests which show how to create and train Random Forest models and use them for prediction:
Random Forest models can be used in many domains and applied to solve many kinds of problems or gain insight from different dataset. Although they work great out of the box, you will likely need to make some adjustments to your model at some point which brings us to Random Forest Tuning.
How to optimize Random Forests?
7- Random Forest Tuning
Random Forest come with great tuning opportunities that makes the model even more robust, performant and sometimes efficient. Knowing how to optimize a Random Forest model can make the difference between successful implementation and fail, or even a successful job interview and fruitless efforts if you’re looking for a machine learning job.
When tuning Random Forest models we have parameters one would be familiar with from Decision Trees and then some. Some of the Random Forest specific hyperparameters are:
- n_estimators
- bootstrap
- warm_start
- max_samples
- n_jobs
- oob_score
The names might be alien if you are new to the Machine Learning field but they are actually really fun and straightforward parameters to work with.
If you are interested in Tuning Random Forests, you can see this post we prepared regardubg Random Forest Hyperparameters and how to tune them:
Is there a random forest implementation example?
7- Random Forest Examples
We put together a simple Random Forest implementation example where you can see random forest machine learning algorithm at work in a few Python lines.