Decision Tree Tutorial
Decision Trees are building blocks of some of the most popular machine learning algorithms such as Random Forest and XGBoost. But this doesn’t mean they are useless on their own.
People like to understand what they are working with. So, even if you are working with random forests first step to understanding some popular algorithms like random forest algorithm actually passes from understanding decision trees first.
In this machine learning tutorial we will elaborate on decision tree models and try to understand their intricacies.
a) Growing Decision Trees
Let’s say you are working on a decision tree classifier to identify if an animal is a cat or not. By asking consecutive sensible questions we can form a decision tree which can come up with a decision.
After asking these questions final answer that leads to the outcome is the leaf node in a decision tree.
b) Core Algorithms
Scikit-learn uses CART tree learning algorithm in its built-in decision tree classifier and decision tree regressor objects.
Other popular learning algorithms for decision trees are CHAID and ID3.
c) Splitting Criteria
Two different criteria are used for decision tree splitting decision interchangeably. These are:
- Gini impurity: Probability of incorrect classification
- Entropy: Amount of entropy (information disorder)
Why Decision Tree Algorithm?
2- Decision Tree Benefits
When dataset is limited in size or when learning process and outcomes need to be observed and interpreted decision trees can still be a good alternative. Besides, they handle missing data and noise exceptionally well and they train much faster than their cousin random forests.
Very interpretable: We can directly observe the decision mechanism and how outcomes are obtained during a decision tree model.
Unique visualization opportunity: Through dendrogram diagrams we can visualize decision trees which contributes to its interpretability even further.
Building blocks: XGBoost, AdaBoost and Random Forest are popular ensemble models which can be extremely accurate out of the box. So, aside of being useful, it can also be helpful to understand decision trees in order to understand random forests.
Easy evaluation: Decision trees can be evaluated with a statistical model like accuracy score easily.
Decision Tree Pros
In addition to decision tree benefits we can list some more pro items that make decision trees a favorable machine learning algorithm.
- They require little data preprocessing
- No normalization needed
- No scaling needed
- Handles missing data well.
Decision Tree Cons
Despite their usefulness, there are also some disadvantages of using decision tree algorithms which you might want to include in your model selection consideration. Some of these are:
- Overfitting complications
- Limited accuracy
- Expensive training complexity
3- Key Industries
Decision Trees are used heavily in all domains through tree-based algorithms such as AdaBoost, XGBoost and Random Forests.
Decision Trees are also used in the industries below when interpreting the model’s working provides additional value:
- Data Science
- Tourism Trends (Forecasting Airplane or Hotel occupancy)
- Business growth outcomes and revenue scenarios
Who Found Decision Tree Algorithm?
4- Decision Tree History
Gordon V. Kass developed CHAID (Chi-square Automatic Interaction Detection) decision tree model in 1980 based on Automatic Interaction Detection. Part of the original paper can be found here.
Decision Trees with CART and ID3 learning models followed in 1980s.
How Fast are Decision Trees?
5- Computational Complexity
Decision Trees have O(Nkd) complexity where N is the rows, k the feature size and d depth of the tree.
Our tests with SVC on a simple classification problem yielded computation results accordingly. (default rbf kernal was used in SVC models, all other parameters were left as default as well.)
56 features, max_depth=16
Decision Tree (500K): 53.95 seconds
Decision Tree (1M): 63.30 seconds
2 features, max_depth=4
Decision Tree (500K): 1.15 seconds
Decision Tree (1M): 1.25 seconds
You can see a complete version of our runtime speed test results regarding Decision Trees in this post:
How to Use Decision Trees?
6- Scikit-learn Decision Tree Implementations
You can easily and conveniently create decision tree models using scikit-learn library in Python, then train them and use them for prediction.
Decision Trees can be used for Classification as well as Regression predictions.
You can check out these tutorials to discover the relevant built-in Scikit-learn decision models:
Also sometimes decision trees will result in improved prediction outcomes with the right parameter settings. You can see more about Decision Tree Optimization in the next session.
How can I improve Decision Trees?
7- Decision Tree Optimization
Decision trees find a feature and a split point in that feature to split the feature into separate classes.
Splitting will be based on the learning algorithm chosen for the decision tree algorithm. Learning algorithms can use one of the following techniques:
- Statistical techniques
- Information theory
- Contrast the purity
When to stop splitting?
We have to tell the decision tree when it can stop splitting nodes.
Decision Trees are built by splitting data to nodes and each node contains information about different categories that exist in data.
For example a training set based on patient data might consist of different variables such as age, sex, diet, smoker, income level, chronic disease etc. decision tree algorithm starts splitting data based on each of these variables and constructs a tree structure that provides the most information for prediction.
Like other supervised machine learning algorithms decision tree model can be used for prediction after training the model with training data. Sometimes important optimization of decision tree parameters can be very useful.
A pure node is a node with homogenity.
Entropy in decision trees is uncertainty of data, so if a node is has high impurity it will also have high entropy.
Decision tree algorithm recursively tries to reach low impurity and low entropy. It has a mechanism of checking information gain after splitting each node.
Information gain can be calculated as Entropy before split – weighted entropy after split
This means if entropy is lowered by a lot information gain will be higher and decision tree will favor the splits resulting in highest information gains.
Some of the commonly tweaked decision tree parameters are:
criterion: “entropy” by default
max_depth: depth of the tree
You can check this article for more detailed Decision Tree Optimization: