kNN Example: Iris
Iris & kNN
1- kNN Classifier Model: Training & Prediction
a) Python Libraries for KNeighborClassifier
We can build a kNN Classifier model using Scikit’s sklearn.neighbor module.
Let’s import the libraries we’re going to use first.
from sklearn.neighbor import KNeighborClassifier from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris
b) Iris Dataset
Iris data is very handy for machine learning demonstration. Let’s utilize it for kNN also. Here is a look at independent variables (features) of iris:
data = load_iris() print(data.data[:5]) print(data.feature_names)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
Target values of Iris are as below:
data = load_iris() print(data.target[:10]) print(data.target_names)
[0 0 0 0 0 0 0 0 0 0]
['setosa' 'versicolor' 'virginica']
import seaborn as sns sns.set_theme(style="darkgrid") df = sns.load_dataset("iris") sns.pairplot(df, hue="species", palette="icefire")
And iris feature-class relations as well as correlations visualized.
c) Creating Train Test Split via train_test_split
As next step we can create our training data and start using the model.
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2)
This part is same for most machine learning tasks. We need to create X_train, X_test, y_train and y_test parts from the dataset.
- X_train and y_train are used to train the machine learning model when we apply the .fit() method.
- X_test test we will use for predicting (with .predict() method) and y_test we will use to evaluate the results of prediction.
- We can also define the ratio for partitioning the data. Here we are reserving 20% using the test_size parameter.
d) Building KNeighborClassifier
We will need to create a kNN classifier and then train it before we can use it for prediction tasks. While creating a kNN model it’s possible to just go with the default values which usually yield satisfactory results initially. But, it’s also possible to create a model with specific parameter values to make it more suitable for the needs and expectations from the machine learning or data science project.
You can see a more comprehensive tutorial that focuses on kNN optimization below:
KNN = KNeighborClassifier()
You can also create a more custom kNN model by assigning/optimizing some of the hyperparameters of the model. Here is an example:
KNN = KNeighborClassifier(n_neighbors= 10, weights = "distance")
2- Training kNN Classifier
We have the model, we have the training data now we can start training the model.
Depending on the dataset runtime speed can change for this process. To have a clear idea about how much time it takes to train Random Forest algorithm and Random Forest complexity for scaling you can see this article:
3- Predicting with Random Forest
After training the model it will be ready for use. Here we are making a prediction using the test partition of data without target values.
We can then compare model’s output (yhat) with target values that we already have (y_test) and see how the model is performing.
yhat=KNN.predict(X_test) print (yhat [0:5]) print (y_test [0:5])
4- Machine Learning Project Check
This is a good stage to take a look at the model performance and project expectations. You can try to see if model is producing expected results and if performance and efficiency are in line with the project. Below we will share “Model Evaluation” and “Model Visualization”, two topics that can be greatly useful when working with Machine Learning models.
Aside of prediction accuracy and computational efficiency, there can be other topics to continue considering such as ethics of the dataset and its acquisition, ethics of the model and its output, biases of the model,
When we are working with great tools we should match that greatness with great sense of responsibility, ethics and compassion.
Additionally, training dataset might not be sufficient or model selection could be questionable, project expectations can be unrealistic and prediction inaccuracy could be harmful or even risk lives.
In short, it’s a useful and even crucial habit to take breaks and seek answers to conscious Machine Learning & AI related questions while iterating back and forth through the project steps.
Visualizing the Neighbors
5- kNN Visualization
Visualizing machine learning models can have many obvious and hidden benefits. By visualizing kNN we can see its prediction accuracy, we can communicate its predictions and we can understand how its hyperparameters are affecting the results.
a) Decision Border Plot using Contourf
Matplotlib’s contourf is a great visualization tool that allows visualizing different regions based on height (or any target value).
Here we will create a meshgrid and predict each value in meshgrid using kNN. Then using the same meshgrid coordinates as X1 and X2 and prediction results as Z we will create a decision border plot of our K-Nearest Neighbor classifier.
Here is a Python code that can be used to create meshgrid and Z (prediction of points in meshgrid) values:
1. Creating meshgrid and Z values
clf = KNeighborsClassifier(n_neighbors=5) clf.fit(X, y) X1min, X1max = X[:, 0].min() - 1, X[:, 0].max() + 1 X2min, X2max = X[:, 1].min() - 1, X[:, 1].max() + 1 step = 0.01 X1, X2 = np.meshgrid(np.arange(X1min, X1max, step), np.arange(X2min, X2max, step)) Z = clf.predict(np.c_[X1.ravel(), X2.ravel()]) Z = Z.reshape(X1.shape)
2. Creating contourf plot for regions and scatterplot for samples points
Great. Now we have X1 and X2 values as well as Z. We can create a decision region plot using these three variables in Python.
We will also overlay the initial data points on the chart area using scatterplot from seaborn.
cmap_dots = ['gray', 'cyan', 'darkred'] plt.figure(figsize=(8, 6)) plt.contourf(X1, X2, Z, cmap="plasma") sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y], palette=cmap_dots, alpha=0.9, edgecolor="black")
It’s lots of fun plotting decision regions and decision borders of machine learning algorithms using contourf. It offers a somewhat low-level charting opportunity and can be very satisfying when you apply it to your machine learning project.
However, sometimes analysts and scientists get busy. If you are looking for a much quicker implementation of decision border visualization you can check out the example below which consists of a higher-level charting technique:
b) Confusion Matrix
Confusion Matrix is a very useful evaluation metric that can be used for classification models. We have demonstrated the use of confusion matrix in another example you can find below. It can be simply tweaked to be able to apply on kNN models.
Other Evaluation Metrics
6- kNN Evaluation
We can evaluate the accuracy of kNN using metrics module from sklearn library. There is a number of metrics we can choose from. accuracy_score is a popular metric that can be used to evaluate classification models.
from sklearn import metrics print("kNN Accuracy is: ", metrics.accuracy_score(y_test, yhat))
kNN Accuracy is: 0.96
If model accuracy is not satisfactory for project criteria, tuning can be an option. You can see the tutorial below for more details on tuning options of K-Nearest Neighbors:
In this machine learning example with kNN algorithm, we have learned how to create a kNN classifier, how to train the kNN model and how to make predictions with it.
Additionally, we have learned how to visualize kNN machine learning algorithm and how to evaluate its results. Also, we have discussed some AI governance topics and Machine Learning optimization practices regarding kNN models.
You can find more Machine Learning Tutorials about other algorithms such as SVM, Random Forest, Decision Trees, Logistic Regression and others on the main page.