K means Clustering is an Unsupervised Learning Algorithm. This means that, in practice, we do not know the desired response for each observation. Unsupervised methods tend to focus more on finding relationships in the data, rather than optimizing for a specific set of outcomes.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import randint
from math import sqrt
df=pd.read_csv("data.csv")
df.head()
V1 | V2 | |
---|---|---|
0 | 2.373546 | 3.183643 |
1 | 2.164371 | 4.595281 |
2 | 3.329508 | 2.179532 |
3 | 3.487429 | 3.738325 |
4 | 3.575781 | 2.694612 |
plt.scatter(df["V1"], df["V2"])
plt.show()
assignments = np.random.choice([1,2,3], size=len(df), replace=True, p=None)
centroids = [np.mean(df[level==assignments]) for level in [1,2,3]]
def plot_graph(assignments,centroids):
centroids=pd.DataFrame(centroids,columns=["V1","V2"])
plt.scatter(df["V1"], df["V2"],color=[(['r','g','b'])[i] for i in (assignments-1)])
plt.scatter(centroids["V1"], centroids["V2"],color=['r','g','b'],s=500)
plt.show()
plot_graph(assignments,centroids)
We can see that each point has been randomly assigned to a cluster.
Before we can start the algorithm, we need a way to calculate the Euclidean distances.
#This function finds the distance from a single point to all 3 centroids.
def distances(point,centroids):
return centroids.T.apply(lambda centroid: sqrt(np.sum([(centroid.V1-point.V1)**2,(centroid.V2-point.V2)**2])) )
converged = False
#assignments = np.random.choice([1,2,3], size=len(df), replace=True, p=None)
#centroids = [np.mean(df[level==assignments]) for level in [1,2,3]]
k=3
while(converged!=True):
plot_graph(assignments,centroids)
centroids=[]
cluster_levels = [i+1 for i in range(0,k)]
#Calculate centroid for each cluster level.
for level in cluster_levels:
index = level==assignments
level_points = df[index]
level_centroid = np.mean(level_points)
centroids.append(level_centroid)
centroids=pd.DataFrame(centroids)
#Loop over points
#points = [point for point in df.iloc]
new_assignments = np.array([distances(point,centroids).argmin(axis=0)+1 for point in df.iloc])
if((new_assignments==assignments).all()):
converged=True
assignments=new_assignments
centroids
V1 | V2 | |
---|---|---|
0 | 3.104574 | 3.010806 |
1 | 4.973649 | 8.147587 |
2 | 7.917699 | 3.032797 |