logo

Orhan Yavuz

Data Scientist & Software Engineer



KNearestNeighbours

K-Nearest Neighbours is a non parametric, supervised Machine learning Algorithm, typically used for classification tasks, although it is also used in regression. The algorithm is based on finding the Euclidean distance between each of the testing and training points.

The Algorithm

for each test point find the k nearest training points. determine the majority class in the k nearest points assign that majority class to the test point.

import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.model_selection import train_test_split
np.random.seed(10)

iris = datasets.load_iris()
data,target=pd.DataFrame(iris.data),pd.Series(iris.target)

The Data

The iris data has 4 features and 3 possible levels in its target.
We will attempt to create a k nearest neighbours classifier that performs well on the iris dataset

data.head()
0 1 2 3
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

The Response

The response is a series of numbers between 0 and 2. These numbers represent the class of each observation which is a row in the "data" object.

target
0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    2
147    2
148    2
149    2
Length: 150, dtype: int32

Splitting the Data

We want to now split the data into testing and training sets.

X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.1, random_state=42)
X_test.head()
0 1 2 3
73 6.1 2.8 4.7 1.2
18 5.7 3.8 1.7 0.3
118 7.7 2.6 6.9 2.3
78 6.0 2.9 4.5 1.5
76 6.8 2.8 4.8 1.4
def knn(k,X_test,X_train):
    predictions=[]
    for test in X_test.iloc:
        distance_points=X_train.apply(lambda x: sqrt(np.sum((x-test)**2)),axis=1)
        k_nearest_index=distance_points.sort_values()[:k].index
        k_nearest=y_train[k_nearest_index]
        prediction=k_nearest.value_counts().index[0]
        predictions.append(prediction)
    return predictions
predictions=knn(7,X_test,X_train)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
array([[6, 0, 0],
       [0, 5, 1],
       [0, 0, 3]], dtype=int64)