kNN is a fantastic ML algorithm that’s easy to understand and implement. But it comes with a few drawbacks. We will explore those drawbacks in this article so you can use this great algorithm more wisely in the future.

## Contents

Introduction

1. What is kNN?
2. A very simple kNN example from daily life
3. Reasons behind kNN being resource-hungry

Summary

## 1. What is kNN?

kNN (k-Nearest Neighbors) is a simple machine learning algorithm that looks at the k closest data points (neighbors) to make a prediction for a new data point based on the majority class or average value.

## 2. A Very Simple kNN example from Daily Life

Suppose you are trying to predict whether someone will like a new movie or not, based on their previous movie ratings. You have a dataset of movie ratings for several people, and you want to predict whether a new person will like a particular movie.

In this case, you could use KNN by calculating the distance (distance here is the difference between numerical values of a feature, such as rating) between the new person and all the other people in the dataset, based on their movie ratings. You would then select the K closest people, and predict whether the new person will like the movie based on the majority opinion of those K people.

For instance, if the K closest people all liked the movie, you would predict that the new person will also like the movie. Conversely, if the K closest people didn’t like the movie, you would predict that the new person will not like the movie.

Bokeh Plot

## 3. Reasons behind kNN being resource hungry

kNN (k-Nearest Neighbors) is a machine learning algorithm that involves finding the k closest data points in a dataset to a given test point. The computational cost of kNN depends on several factors, including

• the size of the dataset,
• the dimensionality of the data,
• and the value of k.

### 3.1 – Distance Calculations

One reason why kNN can be computationally expensive is that it requires calculating the distance between the test point and every point in the dataset. In high-dimensional spaces, the distance calculation can become computationally expensive, especially if the number of dimensions is much larger than the number of data points. Dimension concept here can be confusing for beginners. You can see each feature of each data point as a new dimension which will require new distance calculations between data points. For example, if data is about cars; color, engine size, make, year, weight, power, speed, acceleration, emission etc. each could be a new feature and a new dimension. Some datasets can have thousands even millions of features.

### 3.2- Data Storage in Memory

Another reason why kNN can be resource-hungry is that it requires storing the entire dataset in memory. This can become a problem if the dataset is very large, as the memory requirements can quickly become unmanageable.

Unlike other machine learning algorithms like Naive Bayes or Decision Trees, kNN doesn’t really have a training process. Instead, kNN stores all of the data in memory during the so called training sessions.

### 3.3- Neighbor Size (k)

Additionally, as the value of k increases, the number of data points that need to be considered also increases, leading to a higher computational cost.

To reduce the computational cost of kNN, various techniques have been developed, such as

• using approximation algorithms,
• reducing the dimensionality of the data, or
• employing efficient data structures like KD-trees or ball trees.

Charts above were created using this Bokeh visualization tutorial with Python. On the left, a dataset with 5 neighbors or k=5, and on the right, a dataset with 50 neighbors or k=50.

Below, you can see partial results from a runtime test I’ve made back in 2020. You can read a complimentary article I’ve written to see more practical results from that test and read about:

In this article, we’ve explored potential reasons why kNN may not be suitable for large datasets. We have also discussed the actual time it might take to implement kNN algorithm on large datasets.