CS6140 Machine Learning
HW7 - Local Methods and Clustering
Make sure you check the syllabus for the due date.
PROBLEM 1 [50 points] Local Classification (notes, slides)
For the Spambase
dataset, use local classification on the test set (10-fold cross
validation). The principle is: predict a label for each testpoint
based on the neighbours within a window. For this to wrok, you first
have to decide on a similarity/kernel function between dataoints
K(x,z) - for example Gaussian,
Laplacian, identity, etc.
A) Fixed Window. Fix an appropriate window size W around the test
datapoint, and predict the label as the majority or average of
trainig nighbours within the window
B) KNearestNeighbours: Pick K (for example K=9)
and predict the label as the average or majority
between the closest K training neighbours.
C) Kernel density estimation. Separately for each label class C
(+/-), estimate P(x|C) using the density estimator given by the
kernel K restricted to the training points form class C (no need for
physical windows, as the kernel is weightitng each point by
similarity). Then predict the class by largest Bayesian
probabilities P(C|x) = P(C) * P(x|C) /P(x).
PROBLEM 2 Clustering
20Newsgroup dataset [Extra Credit]
First, decide on a similarity measure for the 20Newsgroup dataset. Use entropy
and purity for evaluation (using the true class labels). For this problem, to get 50 points, you need
to make clustering work for one of the A), B), C) below, not all
three.
A) Use K-Means clustering on the 20Newsgroups datapoints. You can
choose appropriately K in advance (knowing the proper number of
classes), but you'll have to play with the initial centroids.
B) Use hiererchichal clustering on the 20Newsgroups dataset,
with average-link distance between clusters.
C) [Extra Credit] Apply K-medoids, where the K-Means centers
are not average centroids, but instead "middle" points.
PROBLEM 3 Clustering images [Extra Credit]
Same as problem 2 (both K-Means and hierarchical clustering), on the
digits dataset (training data,
labels. testing data, labels). You can choose K=10
in advance.
PROBLEM 4 Clustering with EM [Extra Credit]
Clustering: Apply the EM algorithm on the entire
Spambase
data trainset (both positive and negative) using a mixture of
K=9 gaussians, each gaussian 57-dim. For each cluster
=gaussian/component, consider the training datapoints that are most
associated with it (highest probability in the mixture, out of K
probabilities), and observe the highest count of these points:
positive or negative - this is the prediction label of the cluster.
Use the following testing schema: Assign each
test datapoint to the cluster given by the highest probability
gaussian component. Use the cluster label to make a
prediction. Present the accuracy.
Alternative testing: use each cluster label to
produce a mixture-score prediction, weighted by probability for each
component: F(x) = label_cluster1 * P(x|cluster1) + label_cluster2
*P(x|cluster2) +.... + label_clusterK *P(x|clusterK). Present the
AUC.
PROBLEM 5 [Extra Credit]
Prove of the harmonic functions property discussed in class based
on this
paper. Specifically, prove that to minimize the energy function
f must be harmonic, i.e. for all unlabeled datapoints j, it must
satisfy