logo

Orhan Yavuz

Data Scientist & Software Engineer



KMeansClustering

K means Clustering is an Unsupervised Learning Algorithm. This means that, in practice, we do not know the desired response for each observation. Unsupervised methods tend to focus more on finding relationships in the data, rather than optimizing for a specific set of outcomes.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from random import randint
from math import sqrt
df=pd.read_csv("data.csv")
df.head()
V1 V2
0 2.373546 3.183643
1 2.164371 4.595281
2 3.329508 2.179532
3 3.487429 3.738325
4 3.575781 2.694612
plt.scatter(df["V1"], df["V2"])
plt.show()

The Algorithm

  1. Decide how many k clusters we want the algorithm to partition.
  2. Assign each point to a random cluster.
  3. Calculate the centorid for each cluster.
  4. Assign each point to the cluster that has the closest centroid.
  5. Repeat steps 3, 4 until until convergence, which means that no new points are updated to a different cluster.
assignments = np.random.choice([1,2,3], size=len(df), replace=True, p=None)
centroids = [np.mean(df[level==assignments]) for level in [1,2,3]]

def plot_graph(assignments,centroids):
    centroids=pd.DataFrame(centroids,columns=["V1","V2"])
    plt.scatter(df["V1"], df["V2"],color=[(['r','g','b'])[i] for i in (assignments-1)])
    plt.scatter(centroids["V1"], centroids["V2"],color=['r','g','b'],s=500)
    plt.show()
plot_graph(assignments,centroids)

We can see that each point has been randomly assigned to a cluster.
Before we can start the algorithm, we need a way to calculate the Euclidean distances.

#This function finds the distance from a single point to all 3 centroids.
def distances(point,centroids):
    return centroids.T.apply(lambda centroid: sqrt(np.sum([(centroid.V1-point.V1)**2,(centroid.V2-point.V2)**2])) )

converged = False
#assignments = np.random.choice([1,2,3], size=len(df), replace=True, p=None)
#centroids = [np.mean(df[level==assignments]) for level in [1,2,3]]

k=3
while(converged!=True):
    plot_graph(assignments,centroids)
    centroids=[]
    cluster_levels = [i+1 for i in range(0,k)]
    #Calculate centroid for each cluster level.
    for level in cluster_levels:
        index = level==assignments
        level_points = df[index]
        level_centroid = np.mean(level_points)
        centroids.append(level_centroid)
    centroids=pd.DataFrame(centroids)
    #Loop over points
    #points = [point for point in df.iloc]
    new_assignments = np.array([distances(point,centroids).argmin(axis=0)+1 for point in df.iloc])
    if((new_assignments==assignments).all()):
        converged=True   
    assignments=new_assignments
centroids
V1 V2
0 3.104574 3.010806
1 4.973649 8.147587
2 7.917699 3.032797