Skip to content

Decision Tree Example: Iris

Predicting with built-in Iris dataset

1- Decision Tree Classifier Model: Training & Prediction

a) Importing Python Libraries for Decision Tree Classification

We can use DecisionTreeClassifier from Scikit-Learn to build a decision tree model for classification. First we will need to import all the necessary Python libraries.

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

b) Loading iris dataset

Next we can load iris data and start working on it. Here is how independent variables look like including their names (feature names).

data = load_iris()

print(data.data[:5])
print(data.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]

And dependent values look like this including their names (target names or labels):

data = load_iris()

print(data.target[:10])
print(data.target_names)
[0 0 0 0 0 0 0 0 0 0]
['setosa' 'versicolor' 'virginica']

Seems like first 10 values are all setosa so data seems to have an order potentially.

c) Creating train & test data splits for training and testing the model

We will use train_test_split function to easily create train and test partitions on the data. This could be done manually but train_test_split function makes it so much easier. As output we will get for sets of data:

  • X_train: This part of data is used to train the model. It includes features (or independent variables) but not outcomes (variables, labels or targets).
  • X_test:  Similar to X_train data, usually has a row size of 20% or 30% of the whole dataset. Used for testing the model after training. This is the part that’s used for prediction and its results can be named something like yhat which stands for prediction outcomes (shown with a y with a hat in mathematics).
  • y_train: This part is also used for training and it includes the target values for X_train. This how the model learns by observing X_train against y_train
  • y_test: This part includes the target values for X_test data. After a prediction is made on X_test, y_test is used for comparing and evaluating the model outcomes.
These are the variable names we are using, you could use something else. Eventually it will be the same thing.
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

test_size is a parameter that can be used to define train / test data ratio. Here 0.2 means, 20% of data will be reserved for testing while 80% is reserved for training the model.

d) Constructing a Decision Tree Classification Model

We can begin constructing a Decision Tree Classification model.

DT=DecisionTreeClassifier(criterion="entropy", max_depth=None)

e) Training Decision Tree Model

Next we will train the model on the train data (X_train and y_train) which we split previously. Model will learn at this stage.

DT.fit(X_train, y_train)

f) Predicting with our Decision Tree Classifier

After training the model it will be ready for use. Here we are making a prediction using the test partition of data without target values.

We can then compare model’s output (yhat) with target values that we already have (y_test) and see how the model is performing.

yhat=DT.predict(X_test)

print (yhat [0:5])
print (y_test [0:5])

2- Data Science Questions to Ask

At this point it’s important to have a clear evaluation of the model. Some questions that we can ask are:

Are the results accurate?

At this point it’s important to have a clear evaluation of the model. Some questions that we can ask are:

Is model fast enough for our needs?

At this point it’s important to have a clear evaluation of the model. Some questions that we can ask are:

Is model overfitting?

At this point it’s important to have a clear evaluation of the model. Some questions that we can ask are:

Decision Tree Dendrograms and Text Summary

3- Decision Tree Visualization

We can use plot_tree function from tree module in Scikit-learn to visualize Decision Tree models. These tree plots are called dendrograms in plotting terminology and can be very useful to extract insights and see how the model performed.

The fact that we can observe the actual inner working of decision tree models make them a rare white box machine learning model. White box is a term coined as an antonym of black box models which produces an output that can be interpreted but its inner working can not be observed in a meaningful way.

We will make two types of decision tree visualization using the same tree module from sklearn: 1- plot_tree function and 2- export_text function.

plot_tree
  • Can be used to create dendrograms using tree module.
export_text
  • Can be used to create an elaborate text summary of a decision tree model and all of its splits and rules.

We will use the feature names of our dataset in the dendrogram. Let’s see those feature names first.

print(data.feature_names)
print(data.target_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']

It makes it more intuitive to understand the data before working with it. Here we can see that we are creating decision trees based on features like sepal length, sepal width, petal length and petal width. These characteristics are used to obtain a target value which classifies the sample in one of the three iris categories.

We can proceed with the dendrogram tree plot with following Python code:

a) Dendrogram (plot_tree)

from sklearn import tree
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(30,20))
_ = tree.plot_tree(DT, feature_names=featureNames, filled=True)

b) Decision Tree Text Visualization (export_text)

Additionally we can also create a decision tree schema in text format. This can be achieved using export_text function from sklearn’s tree module.

schema = tree.export_text(DT)
print(schema)

Decision Tree Classification visualized in text format using export_text function:

|--- petal width (cm) <= 0.80
| |--- class: 0
|--- petal width (cm) > 0.80
| |--- petal width (cm) <= 1.75
| | |--- petal length (cm) <= 4.95
| | | |--- sepal length (cm) <= 4.95
| | | | |--- class: 2
| | | |--- sepal length (cm) > 4.95
| | | | |--- class: 1
| | |--- petal length (cm) > 4.95
| | | |--- petal width (cm) <= 1.55
| | | | |--- class: 2
| | | |--- petal width (cm) > 1.55
| | | | |--- class: 1
| |--- petal width (cm) > 1.75
| | |--- petal length (cm) <= 4.85
| | | |--- sepal length (cm) <= 5.95
| | | | |--- class: 1
| | | |--- sepal length (cm) > 5.95
| | | | |--- class: 2
| | |--- petal length (cm) > 4.85
| | | |--- class: 2

You can see that if petal width is smaller than or equal to 0.80, row is directly labeled as ‘setosa’ (class 0). If petal width is greater than 0.80,  then if petal width is smaller than or equal to 1.75 petal and sepal lengths are checked. This process continues until entropy is 0 for each node. Entropy 0 means there is no class disorder (results of split belongs to same class) and there is no information gain to be made by splitting further.

Statistical Evaluation of Classifier Prediction Results

4- Decision Tree Model Evaluation

We can evaluate the accuracy of decision trees with traditional statistical methods. Below you can see an example of accuracy score applied on our decision tree prediction results on iris dataset.

from sklearn import metrics
print("DT Accuracy is: ", metrics.accuracy_score(y_test, yhat))
DT Accuracy is: 0.9333333333333333

Accuracy Score takes values between 0 and 1. 0.93 is a pretty good score considering 1 is the perfect score with no errors made whatsoever and that our model is a single tree and not an ensemble.

Note: Also, please note that maximum tree depth is not specified meaning tree splits as many times as needed to increase the prediction accuracy. However, in visualizations max_depth was assigned to 4 for simpler demonstration purposes.

You can check out this tutorial we have prepared in case you’d like to learn that parameters to optimize decision tree models:

Practice idea:

  • As a practice can you visualize the same data with a similar model and no maximum tree depth limitation?

Summary

In this Decision Tree Tutorial we have created, trained and used for prediction a Decision Tree Classifier using Python and Scikit-Learn library.

We have also practiced how to visualize Decision Trees and how to evaluate Decision Trees? Finally, we discussed some of the questions one can ask regarding the outcomes from a decision tree machine learning implementation.