CS6140 Machine Learning

HW7 - Local Methods and Clustering

Make sure you check the syllabus for the due date.

PROBLEM 1 [50 points] Local Classification (notes, slides)

For the Spambase dataset, use local classification on the test set (10-fold cross validation). The principle is: predict a label for each testpoint based on the neighbours within a window. For this to wrok, you first have to decide on a similarity/kernel function between dataoints K(x,z) - for example Gaussian, Laplacian, identity, etc.

A) Fixed Window. Fix an appropriate window size W around the test datapoint, and predict the label as the majority or average of trainig nighbours within the window

B) KNearestNeighbours: Pick K (for example K=9) and predict the label as the average or majority between the closest K training neighbours.

C) Kernel density estimation. Separately for each label class C (+/-), estimate P(x|C) using the density estimator given by the kernel K restricted to the training points form class C (no need for physical windows, as the kernel is weightitng each point by similarity). Then predict the class by largest Bayesian probabilities P(C|x) = P(C) * P(x|C) /P(x).

PROBLEM 2 Clustering 20Newsgroup dataset [Extra Credit]

First, decide on a similarity measure for the 20Newsgroup dataset. Use entropy and purity for evaluation (using the true class labels). For this problem, to get 50 points, you need to make clustering work for one of the A), B), C) below, not all three.

A) Use K-Means clustering on the 20Newsgroups datapoints. You can choose appropriately K in advance (knowing the proper number of classes), but you'll have to play with the initial centroids.

B) Use hiererchichal clustering on the 20Newsgroups dataset, with average-link distance between clusters.

C) [Extra Credit] Apply K-medoids, where the K-Means centers are not average centroids, but instead "middle" points.

PROBLEM 3 Clustering images [Extra Credit]

Same as problem 2 (both K-Means and hierarchical clustering), on the digits dataset (training data, labels. testing data, labels). You can choose K=10 in advance.

PROBLEM 4 Clustering with EM [Extra Credit]

    Clustering: Apply the EM algorithm on the entire Spambase data trainset (both positive and negative) using a mixture of K=9 gaussians, each gaussian 57-dim. For each cluster =gaussian/component, consider the training datapoints that are most associated with it (highest probability in the mixture, out of K probabilities), and observe the highest count of these points: positive or negative - this is the prediction label of the cluster.

    Use the following testing schema: Assign each test datapoint to the cluster given by the highest probability gaussian component. Use the cluster label to make a prediction. Present the accuracy.

    Alternative testing: use each cluster label to produce a mixture-score prediction, weighted by probability for each component: F(x) = label_cluster1 * P(x|cluster1) + label_cluster2 *P(x|cluster2) +.... + label_clusterK *P(x|clusterK). Present the AUC.

PROBLEM 5 [Extra Credit]

Prove of the harmonic functions property discussed in class based on this paper. Specifically, prove that to minimize the energy function

f must be harmonic, i.e. for all unlabeled datapoints j, it must satisfy