K-nearest neighbours in 3 functions

Feb 11, 2018

4 minutes read

K-nearest neighbours is a nice little algorithm for classification problems. It assigns a class for a data point based on the class of the K (an intenger) closest data points. As an example, let’s take a one dimensional problem. Say we have a group of people at a party, and we know their heights and gender. Someone new is arriving, and we only know their height (…ok, this is an odd party). We can assign, or classify, the person as male or female based on the K (say, three) most similar heights from the rest of the guests.

K-nearest neighbours is what is called a lazy algorithm, since the classification is done on the fly for each case, using the whole data set. Thus for every new person arriving at the party, we have to calculate all of the height proximities again. This can make the algorithm quite expensive for large, multidemensional datasets. Anyway, lets look at implementing KNN in python. We will need three functions:

Measure the distances between the instance to be classified and all other data points
Find the K nearest neighbours (closest data points to the instance to be classified)
Vote on the classification based on the neighbours

Lets go through these one by one, using the classic Iris dataset. The Iris data set uses four variables (sepal width, sepal length, petal length, petal width) and assigns one of three types of Iris flower. We will use 80% of this data set for training, and test it using the other 20%. Since this is a lazy algorithm, for each vale of the test set we need to compute against the other 80% (the known values) each time.

1. Calculate the distance

We need to compute the simalirity between two points. As a measure of similarity, we will be using a the Euclidean distance. With our height and gender example, we were working in 1D vector space. In the Iris set, we are working in 4-dimensional space (sepal width, sepal length, petal length, petal width). To calculate the distance between two datapoints, we can use this general function:

def euclidean_distance(point1,point2):
    dist2=[(b - a)**2 for (a, b) in zip(point1,point2)]
    dist=np.sqrt(sum(dist2))
    return dist

This works for any dimensional space, whether we have 1,2,4 or 40 variables.

2. Find the nearest neighbours

Once we have our distances, we choose the K smallest values to use as our classifiers. These are the neighbours. We use our euclidean_distance function within our neighbours function, calculating the distances between all points in the dataset and the point to classify. As a note, our points are 5-dimensional (sepal width, sepal length, petal length, petal width,class), where the 5th is the class, so we remove this 5th dimension from our distance calculation:

import operator
def get_neighbours(dataset,point,k):
    distances = []
    for i in range(len(dataset)):
        dist=euclidean_distance(point,dataset[i][:-1]) 
        distances.append((dataset[i], dist)) 
    distances.sort(key=operator.itemgetter(1)) 
    neighbours = []
    for x in range(k):
        neighbours.append(distances[x][0]) 
    return neighbours

This is the main body of the algorithm. We pass the whole data set and loop over it, assigning the neighbours for our point in question. Next, we need to vote on which class to assign our point.

3. Vote on the class

We need to take our nieghbours and have them vote on which class we assign to our data point. We do this simply by finding the majority:

def get_response(neighbours):
    classVotes = {} 
    for x in range(len(neighbours)):
        response = neighbours[x][-1] 
        if response in classVotes: 
            classVotes[response] += 1 
        else:
            classVotes[response] = 1 
    sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
    return sortedVotes[0][0]

By passing our neighbours to the get_response, the class will be returned to us. Let’s put all this together now.

Main algorithm

import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split

df=pd.read_csv("iris.data",names=["sepal length", "sepal width", "petal length","petal width","class"],index_col=False)

df.loc[df['class']=="Iris-virginica","class"]=0
df.loc[df['class']=="Iris-setosa","class"]=1
df.loc[df['class']=="Iris-versicolor","class"]=2


X_train, X_test, y_train, y_test = train_test_split(np.array(df), np.array(df.loc[:,'class']), test_size=0.2, random_state=1)

predictions=[]
k = 3
for x in range(X_test.shape[0]):
    point=[X_test[x][0],X_test[x][1],X_test[x][2]]
    neighbours=get_neighbours(X_train,point,k)
    result=get_response(neighbours)
    predictions.append(result)

And that’s it! We have our set of predictions. We can test the accuracy of these based on a fourth function (sorry, I lied when I said 3 functions):

def accuracy(actual, prediction):
    incorrect=(np.array(prediction)-np.array(actual))**2
    total=len(np.nonzero(incorrect)[0])
    return ((len(actual)-total)/len(actual))*100  

print(accuracy(predictions))
>>>96.6667

There we go, the 3-Nearest neighbours is able to predict at 97% accuracy. Not bad considering how simple this algorithm is. Thanks for reading!

Back to posts