Skip to content

Random Forest Example: Iris

Predicting with built-in Iris dataset

1- Random Forest Classifier Model: Training & Prediction

a) Python Libraries for RandomForestClassifier

We can use RandomForestClassifier from Scikit-Learn to build a Random Forest model for classification. 

Let’s import the libraries we’re going to use first.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

b) Built-in Iris Dataset

Then we can load iris dataset from Scikit-Learn library’s load_iris module. Let’s explore the data a little bit.

data = load_iris()

print(data.data[:5])
print(data.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]

Class names appear as following:

data = load_iris()

print(data.target[:10])
print(data.target_names)
[0 0 0 0 0 0 0 0 0 0]
['setosa' 'versicolor' 'virginica']
import seaborn as sns
sns.set_theme(style="darkgrid")

df = sns.load_dataset("iris")
sns.pairplot(df, hue="species", palette="icefire")
Iris Dataset Scatter Matrix with Seaborn

According to the pairplot setosa class seems pretty separable with all features while there is lots of overlapping between versicolor and virginica. Machine Learning will be a perfect approach for solving classification on this dataset.

c) Creating Train Test Split via train_test_split

Let’s explore the dataset a little bit through visualization. Seaborn library has a great feature for datasets like this called pairplot. We can use it to see correlations and overlapping between different features and target.

d) Creating Train Test Split via train_test_split

This part is same for most machine learning tasks. We need to create X_train, X_test, y_train and y_test parts from the dataset.

  • X_train and y_train are used to train the machine learning model when we apply the .fit() method.
  • X_test test we will use for predicting (with .predict() method) and y_test we will use to evaluate the results of prediction.
  • We can also define the ratio for partitioning the data. Here we are reserving 20% using the test_size parameter.
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)

Now all the data we need is properly divided between and ready to use for training, predicting and model evaluation.

d) Building Random Forest Classifier

We can create the Random Forest model which we will implement. We can do this by using RandomForestClassifier we’ve already imported.

This is also a good stage for defining some of the parameters of the Random Forest model. Some of the most commonly adjusted parameters are: n_estimators (for tree amount), n_jobs (for parallel computing), max_depth (for tree depth) and max_features (limiting feature size). You can see more about Random Forest parameters and how to optimize them here:

Let’s create a plain vanilla random forest model with all the default settings:
RF=RandomForestClassifier()

You could also create a Random Forest model with a few parameters defined specifically as following:

RF=RandomForestClassifier(n_estimators = 10, n_jobs = -1)

e) Training Random Forest

We have the model, we have the training data now we can start training the model.

RF.fit(X_train, y_train)

Depending on the dataset runtime speed can change for this process. To have a clear idea about how much time it takes to train Random Forest algorithm and Random Forest complexity for scaling you can see this article:

f) Predicting with Random Forest

After training the model it will be ready for use. Here we are making a prediction using the test partition of data without target values.

We can then compare model’s output (yhat) with target values that we already have (y_test) and see how the model is performing.

yhat=DT.predict(X_test)

print (yhat [0:5])
print (y_test [0:5])

2- Model Expectations and Questions to Ask

After making prediction it’s common in machine learning to evaluate the results and iterate different ML or Data Science processes.

It’s also a good practice to align the expectations or improve the model. Expectations will be based on the project and should be considered carefully.

For example these questions can be beneficial before, during and after the modeling process:

  1. Who will use the model
  2. What are implications of false negatives or false positives?
  3. What level of accuracy is tolerable?
  4. Or what level of error rate is tolerable?
  5. Could wrong implementation cause major damage or even hurt someone? (Medical applications, self driving cars, investment management, energy trading, power grid optimization etc.)
  6. How can model be improved?
  7. Is training data sufficient?
  8. What visualization methods can help communicate model’s results?
  9. Will data be collected continuously? If yes where and how will it be stored? (sql databases such as postgreSQL, SQLite, MySQL, SQL Server, Cloud SQL solutions, No SQL etc.) 
  10. Will model be deployed in real time or make real time predictions?
These are some of the questions that can be asked and the list is by no means exhaustive but it just gives a rough idea about project specific Machine Learning considerations.

Plotting Individual Trees in a Random Forest

3- Random Forest Visualization

We can also extract individual trees from a Random Forest and visualize those individually.

a) Random Forest Dendrogram

It’s possible to plot dendrograms of individual trees in a random forest. This might offer more in depth insight regarding the depth of trees and information gains for each tree, Of course in random forest tree learning results are averaged but this doesn’t mean individual trees don’t offer any insight regarding the random forest model.

Here is the Python code for extracting an individual tree (estimator) from Random Forest:

ind_tree = (RF.estimators_[4])
print(ind_tree)
DecisionTreeClassifier(max_features='auto', random_state=792013477)

Here we are printing the 5th tree (index 4). 

We can create a dendrogram (or tree plot) similar to what we did for Decision Trees.

from sklearn import tree
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(30,20))
_ = tree.plot_tree(DT, feature_names=featureNames, filled=True)
Tree plot of single tree in a random forest model

We can also visualize multiple trees at one time using a for loop and create a collage. These dendrograms will be less easier to interpret and potentially less useful, however, we can still learn so much about the random forest model. Let’s create a 7 column plot with each individual tree in random forest taking up one column.

You can also increase dpi parameter as below to your liking for increasing the resolution in case you’re going to save the plot and zoom in the image to investigate each tree more closely. Increasing the dpi will take more computation resources and space.

fig, axes = plt.subplots(nrows = 1, ncols = 7, figsize = (15,3), dpi=1800)

for i in range(0, 7):
    tree.plot_tree(RF.estimators_[index],
                   feature_names = data.feature_names,
                   class_names = data.target_names,
                   filled = True,
                   ax = axes[i])
Visualizing individual trees in a random forest model

b) Confusion Matrix

from sklearn.metrics import plot_confusion_matrix

plot_confusion_matrix(RF, X_test, y_test,
                     display_labels=data.target_names,
                     cmap="bone")
Random Forest Confusion Matrix

We can see that, 1 sample is being confused which a virginica being labeled as a versicolor iris. 29 other samples are being classified correctly. Keep in mind we did a train test split of 20% so these 30 samples are representing that test partition.

Investigating machine learning prediction results

4- Random Forest Model Evaluation

We can evaluate the accuracy of decision trees with traditional statistical methods. Below you can see an example of accuracy score applied on our decision tree prediction results on iris dataset.

from sklearn import metrics
print("DT Accuracy is: ", metrics.accuracy_score(y_test, yhat))
RF Accuracy is: 0.9

You may want to tune your random forest model based on the needs of your project:

Some of the criteria regarding tuning based on project specific expectations are:

  1. Accuracy
  2. Performance
  3. Time for training
  4. Simplicity
  5. Computational efficiency
  6. Need for continuing training
  7. Data specific needs (addressing bias, noise, non-linearity) etc.

Summary

In this part of our Random Forest Tutorial Series we covered a simple implementation using Random Forest algorithm with built-in Random Forest Classifier model in Scikit-Learn.

We’ve created a Random Forest model using RandomForestClassifier from sklearn.ensemble module. Then we’ve trained it, used it on a built-in dataset for prediction and we evaluated its result accuracy. You can also check out RandomForestRegressor for predicting continous values with Random Forest machine learning models.