Hard & Soft Clustering with K-means, Weighted K-means and GMM-EM in Python

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning.

In this assignment the following clustering algorithms will be implemented:

  1. Hard clustering with K-means
  2. Soft clustering with
    a. Weighted K-means
    b. Gaussian mixture models with Expectation Maximization.

Some datasets with n data points {x_1,…,x_n} will be used for testing the algorithms, where each  x_∈ R^d.

Hard Clustering

  1. Each point is assigned to a one and only one cluster (hard assignment).
  2. With K-means we try to find K centroids {μ1,,μK} and the corresponding assignments of each data point {c1,,cn}  where each ci{1,,K and c_i indicates which of the K clusters the observation x_i belongs to. The objective function that we seek to minimize can be written as

im1

Soft Clustering

(1) Each point is assigned to all the clusters with different weights or probabilities (soft assignment).

(2) With Weighed K-means we try to compute the weights ϕ_i(kfor each data point i to the cluster k as minimizing the following objective:

im1_5.png

(3) With GMM-EM we can do soft clustering too. The EM algorithm can be used to learn the parameters of a Gaussian mixture model. For this model, we assume a generative process for the data as follows:
im2

In other words, the i_th observation is first assigned to one of K clusters according to the probabilities in vector π, and the value of observation x_i is then generated from one of K multivariate Gaussian distributions, using the mean and covariance indexed by c_i. The EM algorithm seeks to maximize

im3.png
over all parameters π,μ1,…,μK,Σ1,…,ΣK using the cluster assignments c1,…,cn as the hidden data.

The following figures show the algorithms that are going to be implemented for clustering.

hc.png

sc.png

gmm

The following animations show the output of the clustering algorithms (and how they converge with different iterations) on a few datasets (with k=3 clusters), the weighted k-means is run with the stiffness-parameter beta=10.

Dataset1

kmeans1wkmeans1

gmmem1

Dataset2

kmeans2wkmeans2

gmmem2

Dataset3

kmeans3wkmeans3
gmmem3

Dataset4

kmeans4wkmeans4

gmmem4

Dataset5

kmeans5wkmeans5gmmem5
Dataset6

kmeans6wkmeans6

gmmem6

Dataset7

kmeans7wkmeans7

gmmem7

Dataset8

kmeans8
The next animation shows the results with weighted K-means with stiffness parameter beta=10
wkmeans8

The next animation shows the results with weighted K-means with stiffness parameter beta=1
wkmeans_1.gif

gmmem8

Here is how the log-likelihood for EM continuously increases over the iterations for a particular dataset:
ll.png

The following animations show GMM EM convergence on a few more datasets:

gmmem3d1.gif

gmmem3d2

gmmem3d3.gif

gmmem3d4

gmmem3d5.gifgmmem3d6

gmmem3d7.gif

gmmem3d8.gif

gmmem3d9gmmem3d10

gmmem3d11

gmmem3d0

gmmem3d12

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s