Core Algos

2020-12-25 00.45.40 stackoverflow.com bd4984b202e7

kNN algorithm still needs a method to store data in memory before making predictions and there are multiple ways to store data and vectorize it in space.

This step is usually handled by either one of three algorithms in kNN algorithm: brute, kd_tree and ball_tree.

brute: Simply handles each data point exactly where they are and later on distance calculations are made between each point. This method ensures increased accuracy however, it’s a very costly way from the perspective of computation. Brute algorithm comes with O(DN^2) complexity where D is features and N is data size.

kd_tree: A tree search algorithm that can replace brute for larger data size implementations. kd_tree still has high accuracy but it makes improvements on computation efficiency. It has time complexity of O(DNlogN).

ball_tree: Another tree algorithm that outperforms kd_tree in some specific situations such as big data with very high dimension.

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space –Wikipedia

kd_tree name stands for k dimensional tree as the algorithm tries to allocate data to tree nodes and eventually leaves.

Ball_tree on the other hand achieves tree structures by clustering data based on farthest data points and assigning new clusters based on those which.

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space –Wikipedia

Usage in kNN with Scikit-Learn

Implementation

You can simply assign these algorithms when building a kNN model in Scikit-learn. All you have to do is pass one of them to the algorithm parameter.

brute
kd_tree
ball_tree

auto: This option automatically assigns one of the three algorithms based on the suitability to the training data. It’s usually very accurate in choosing the right algorithm based on potential performance outcomes.

from sklearn.tree import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm = "kd_tree")

Praesent porttitor, nulla vitae posuere iaculis, arcu nisl dignissim dolor, a pretium mi sem ut ipsum. Fusce fermentum.

How Do kNN Core Algorithms Compare?

Ball Tree vs Kd_tree vs Brute Force

kNN uses different algorithms to be able to store and map data properly. These data algorithm options are namely: brute force, ball_tree, kd_tree and auto.

‘ball_tree‘: A tree algorithm that works well with larger datasets. It will perform well especially with high dimension data.
‘kd_tree‘: An alternative tree algorithm that works well with large datasets. It can sometimes outperform ball_tree significantly when data doesn’t have very high dimensions.
‘brute‘: Least performant yet most accurate algorithm. Uses brute force to calculate and store the position of each point one by one. This option’s accuracy comes with a cost of computational resource usage and won’t always be justified. Only suitable for small datasets.
‘auto‘: Scikit-learn’s attempt for kNN to choose the most suitable algorithm based on training data. Works very well in most cases and succeeds in finding the appropriate algorithm among the three options. Auto is also the default parameter value for algorithm in kNN classes in Scikit-Learn.

So, if you’re not having serious performance issues, auto option will most likely be suitable for you. If data is large and you want to make sure right implementation is in place, testing between ball_tree and kd_tree will be the smart thing to do.

Both of these tree algorithms are known to beat each other in performance when working with big data although ball_tree is expected to be superior when feature size is very very high but data characteristics such as noise also matter so testing is still best approach to find out in custom cases.