K-Nearest Neighbours is a non parametric, supervised Machine learning Algorithm, typically used for classification tasks, although it is also used in regression. The algorithm is based on finding the Euclidean distance between each of the testing and training points.
for each test point find the k nearest training points. determine the majority class in the k nearest points assign that majority class to the test point.
import numpy as np
import pandas as pd
from sklearn import datasets
import matplotlib.pyplot as plt
from math import sqrt
from sklearn.model_selection import train_test_split
np.random.seed(10)
iris = datasets.load_iris()
data,target=pd.DataFrame(iris.data),pd.Series(iris.target)
The iris data has 4 features and 3 possible levels in its target.
We will attempt to create a k nearest neighbours classifier that performs well on the iris dataset
data.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
The response is a series of numbers between 0 and 2. These numbers represent the class of each observation which is a row in the "data" object.
target
0 0 1 0 2 0 3 0 4 0 .. 145 2 146 2 147 2 148 2 149 2 Length: 150, dtype: int32
We want to now split the data into testing and training sets.
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.1, random_state=42)
X_test.head()
0 | 1 | 2 | 3 | |
---|---|---|---|---|
73 | 6.1 | 2.8 | 4.7 | 1.2 |
18 | 5.7 | 3.8 | 1.7 | 0.3 |
118 | 7.7 | 2.6 | 6.9 | 2.3 |
78 | 6.0 | 2.9 | 4.5 | 1.5 |
76 | 6.8 | 2.8 | 4.8 | 1.4 |
def knn(k,X_test,X_train):
predictions=[]
for test in X_test.iloc:
distance_points=X_train.apply(lambda x: sqrt(np.sum((x-test)**2)),axis=1)
k_nearest_index=distance_points.sort_values()[:k].index
k_nearest=y_train[k_nearest_index]
prediction=k_nearest.value_counts().index[0]
predictions.append(prediction)
return predictions
predictions=knn(7,X_test,X_train)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
array([[6, 0, 0], [0, 5, 1], [0, 0, 3]], dtype=int64)