Theory

Theory#

Optimization Problem#

The goal is to maximize within-cluster similarity, or equivalently, minimizing the within-cluster variation:

\[ min_{C_1,...,C_K} \sum_k W(C_k)\]

where W(.) is a measure, most commonly Euclidean distance:

\[ min_{C_1,...,C_K} \sum_k \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_1^p (x_{ij}-x_{i'j})^2 \]

where \(|C_k|\) denotes the number of observations in the \(k\)th cluster and \(p\) is the number of features.

Algorithm#

K-means just requires \(k\), the number of clusters, as an input. There are few variations of the algorithm and different distance measures can be used. The following is one example:

The algorithm uses the following steps:

Randomly initialize all points to one of \(k\) clusters.
Compute the centroids of each cluster.
Assign all points to the cluster whose centroid is closest
Repeat steps 2 and 3 until convergence, i.e. fewer than X% of points are re-assigned clusters at step 3.

Note that this algorithm finds a local, rather than global, minimum. So may be sensitive to the random initialization. So make sure to run the algorithm multiple times.

Run Manually for Demonstration#

class KmeansDemo:

    def __init__(self, X, K):
        self.X = X
        self.K = K
        self.n = len(self.X)
        self.distances = []
        self.clusters = {0:np.random.choice(self.K, self.n)}
        self.centroids = {}

    def get_centroids(self, clusters):
        return np.array([
            self.X[clusters==k].mean(axis=0)
            for k in range(self.K)
        ])

    def get_clusters(self, centroids):
        distances = np.linalg.norm(
            self.X[:,None,:] - centroids[None,:,:], 
            axis=-1
        )
        return np.argmin(distances, axis=1)

    def __call__(self):
        self.centroids[0] = self.get_centroids(self.clusters[0])

        # save each iter for plotting
        for iter in range(1, 15):
            self.clusters[iter] = self.get_clusters(self.centroids[iter-1])
            self.centroids[iter] = self.get_centroids(self.clusters[iter])
            update = np.mean(mod.clusters[iter-1]!=mod.clusters[iter])
            print(iter, update)
            if update < 0.002:
                break

Prints iteration and percentage of points updated. Stops when no new points are updated.

mod = KmeansDemo(X=X, K=3)
mod()

Graph Each Iteration#

color_dict =  {0:"red", 1:"green", 2:"purple"}

sns.set(rc={'figure.figsize':(8,5)})

def plot_iter(i):
    
    fig = plt.figure(i)
    ax = sns.scatterplot(X[:,0], X[:,1], hue=mod.clusters[i])
    sns.scatterplot(mod.centroids[i][:,0], mod.centroids[i][:,1], color="red", s=100)
    ax.set_title(f"Iter {i}")

for _ in range(len(mod.centroids)):
    plot_iter(_)

../../../../_images/f7644eb5f0791a96c4c719cc77f3e4c58f81d99fc855f59f8cffaf29a9bea5bf.png

../../../../_images/c3c7d13f13023d02f5eba01af1b364ebc5c74ae1d791a23424a8e803a07eddbe.png

../../../../_images/164f2bf2ac39d83a064112cb8448b784e82cf0ccffb00b3e7882a705e07f40cb.png

../../../../_images/d9c6821e265bb77d6268d92ab93835ba99b7817a046b626ebb47a0afdd3b80aa.png

../../../../_images/0173ea63f81fa894f2fce690c18864c02cb71c6029adca6ade2abfeeec033edc.png

../../../../_images/93a3172d2e0fd9b8a2179e370910ffba17cd93e5442c15bcb6f1b38f14d8bb35.png

../../../../_images/0616d717a9d37f356793e83ac225c4e03d12f84a4093e33afbacefcec127598e.png

../../../../_images/b643dc75698bd3171bd6c75faf8c564a142876264c1547df61dc582f4f8913bb.png

../../../../_images/493b9692a3e6a60125f7c7c32e9da84ca0f23bfb6851eacccfc2b22adb5edc47.png