kNN Tuning

Decision Trees can be improved by tuning their hyperparameters.

Amount of neighbors aka "k" in kNN

1- n_neighbors

This is probably the most major parameter in a Nearest Neighbor implementation. In Scikit-Learn you can define “k” value of the kNN algorithm by using the n_neighbors parameter.

Both in classification and regression implementations of the kNN algorithm, n_neighbors value will be used to calculate a sample’s distance values which are used to make kneighbor queries.

In simpler terms, for each sample n_neighbors (k) amount of neighbors will be identified and decision will be made based on those neighbors.

Since it’s a very fundamental parameter in kNN algorithm, n_neighbors can directly affect the accuracy of the results as well as runtime performance.

n_neighbors is 5 by default and you can adjust it as below while creating a kNN model:

KNN = KNearestNeighbors(n_neighbors = 9)

RandomForest Specific

Random Forest Model has its own parameters that can be tuned:

n_estimators
n_jobs
oob_score

Shared with Decision Trees

Random Forest Model also has trees in it to which tree parameters apply.

criterion
splitter
max_features

Processing and vectorizing sample points

2- algorithm

This parameter defines the algorithm that’s used to calculate neighbor distances. It is “auto” by default in Scikit-Learn which works pretty well to identify the ideal algorithm that should be used to calculate distances between sample points.

You can also choose to assign three other algorithms manually to algorithm parameter. These are:

brute: Most accurate yet least performance option. Usually an overkill and won’t be chosen especially when data is big.

ball_tree: An algorithm that’s used often for kNN implementations with ideal performance / accuracy ratio.

kd_tree: An algorithm that performs well when working with big data. Also might beat ball_tree performance depending on the dataset characteristics as well as dimensionality (feature size).

It can be difficult to choose between ball_tree and kd_tree algorithms manually just by looking at the data size and shape but thankfully, auto option usually does a great job making the ideal choice based on performance and accuracy considerations.

If you’d like to read more about core algorithms used in kNN you can visit this article:

kNN Core Algorithms

RF = RandomForestClassifier(n_jobs=-1)

A parameter to use with kd_tree and ball_tree algorithms

3- leaf_size

If you decide to use kd_tree or ball_tree algorithms. These are actually algorithms that create tree structures to vectorize points. These tree search algorithms improve kNN’s performance to O(NlogN) compared to brute algorithm which selects each sample one by one hence has O(N^2) time complexity which scales much less efficiently.

Since kd_tree and ball_tree algorithms will be used often, parameter leaf_size which affects them becomes significant. In Scikit-Learn implementation of KNearestNeighbors model leaf_size is 30 by default.

Tree structures are used to process querying data in batches rather than one by one (as in brute case). This means that if you increase the leaf size, data will be processed in larger batches which can improve the kNN’s runtime performance drastically. However, the risks of wrong neighbors ending up in the same batch also increases which may reduce the accuracy in some cases.

It’s very difficult to make precise estimations on leaf_size without experimenting with the dataset in question since it depends on the order, structure and balance of the dataset.

leaf_size offers a great opportunity to fine tune kNN algorithm when performance is a critical criteria.

If you are curious in how leaf nodes work and more details about their intricacies you can also check out Decision Tree algorithm which is completely based on tree structures.

Decision Tree Algorithm

RF = RandomForestClassifier(max_samples=0.5)

Distance calculations between sample points

4- Distance metrics in kNN

We have many distance calculation methods in scikit-learn intended to be used with kNN. For distance between vectors on the same dimensional space (such as two points on paper) following options are available and for 2 dimensional vectors (such as two points on Earth surface, since Earth surface has a curve) there is also Haversine distance calculation embedded in.

a) metric and p

metric parameter is used to define the method for distance calculations between sample points in kNN algorithm. By default metric is minkowski with parameter 2 (p=2).

KNN = KNearestNeighbors(metric="minkowski", p=2)

b) Euclidean vs Minkowski vs Manhattan

One dimensional vector space distance calculations are as following:

Euclidean (p=2)
Minkowski
Chebyshev
Manhattan (p=1)
Wminkowski
Seuclidean
Mahalanobis

Two dimensional vector space distance calculations:

Haversine

In scikit-learn these distance calculations can be assigned using “metric” parameter when constructing a kNN model. By default metric=”minkowski” and p=2. Please note that p=2 in minkowski distance calculation is equal to euclidean distance calculation and p=1 is equal to Manhattan distance calculation. Furthermore you can use other p values for arbitrary minkowski distance metrics.

Here is a practical example from scikit-learn:

knn=KNeighborsRegressor(n_neighbors=10, metric="minkowski", p=2)

kNN Regressor Model is created and assigned to variable knn using following parameters:

k=10
distance metric=minkowski
p=2

Please note in most cases distance calculation metric won’t make any difference in results. Euclidean is somewhat the norm metric because it caters to our spatial sensory and how we interpret space based on rotation (angle) in the Universe. Manhattan is a special case since highrises won’t let you have a Euclidean orientation hence its own metric “Manhattan”. (see Image)

kNN Parallelization

5- n_jobs

Parallelization is one of the handiest feature that can be utilized in kNN implementations. Parallelization is the concept of taking advantage of multi-core processors by making multiple cores work in parallel.

In Scikit-Learn parallelization can be handled using n_jobs parameter. By default n_jobs is assigned to 1 which only uses 1 processor when kNN is busy. To use a specific number of processor cores you can assign an integer value or to use all the cores in parallel you can assign n_jobs to -1 which will be very efficient to address sluggish runtime performance of kNN algorithm especially when working with big data or large dimensions.

To see a comprehensive study about kNN’s complexity and runtime performance you can refer to the article below:

kNN Complexity

KNN = KNearestNeighbors(n_jobs=-1)

Summary

Random Forest is a very versatile and tweakable machine learning algorithm. Its hyperparameters are straightforward and it offers even more flexibility than Decision Trees. A Random Forest algorithm has all the hyperparameters a Decision Tree has and more.

In this Random Forest Tutorial, we’ve seen the most commonly used Random Forest parameters and multiple ways to tweak for better Random Forest Tuning & Optimization.

When you understand the parameters and hyperparameters of the machine learning model you are using you will have more power over the model and overall machine learning process. This will allow you to tweak the results to fit your exact needs and outclass the competition.