Make sure you check the syllabus for the due date.
PROBLEM 1 [40 points for part C] k-Nearest Neighbors (kNN)
Implement
the kNN classification algorithm such that it can work with
different distance functions (or kernels as similarities)
A optional) Fix values of ‘k’= "the number of closest neighbors used",
i.e. k=1,k=3, and k=7.
Spambase dataset. Try k=1,3,7 with Euclidian distance.
Digits Dataset, with extracted features. Try k=1,3,7 and the
following similarities/distances: Cosine distance, Gaussian Kernel, Polynomial degree-2 Kernel
B optional) Fixed Window. Instead of using the closest k Neighbors, now we are going to use
all neighbors within a window. It should be clear that a testpoint z1
might have many neighbors (a dense area) in its window, while
another testpoint z2 might have only (a sparse area) or
none at all.
Fix an appropriate window size around the test
datapoint by defining a windows "radius" R , or the maximum distance
allowed. Predict the label as the majority or average of training
neighbors within the window.
Run on Spambase dataset with Euclidian distance.
Run on Digits dataset with cosine distance.
C required) Kernel density estimation. Separately for each label class,
estimate P(z|C) using the density estimator given by the kernel K
restricted to the training points form class C. There is no need for
physical windows, as the kernel is weighting each point by
similarity:
where mC is the number of points in the training set with
label C. Then predict the class by largest Bayesian probabilities
P(C|z) = P(C) * P(z|C) /P(z).
Run on Spambase dataset with Gaussian kernel.
(optional) Run on Digits dataset with Gaussian kernel
(optional) Run on Digits dataset with polynomial kernel.
PROBLEM 2 [50 points] Dual Perceptron with kernels
A) Implement the dual version of the perceptron, keeping in mind that you have to
make it work with kernels. Apply the dual perceptron to the artificial dataset we used on
HW2, in order to make sure your dual is equivalent to the
primal (same solution).
Note: in our implementation of the dual
perceptron, we did not pre-process convert all X to positive
labels, as we did with the primal perceptron. Such
conversion would imply that the kernel implicit mapping satisfies
-ɸ(x) = ɸ(-x) which is not necessarily true.
B) We have created another artificial binary
dataset "two spirals", this time linearly highly-nonseparable
on purpose. Try to run the dual perceptron (with dot product) and
conclude that the perceptron does not work. Then run the dual
perceptron with the Gaussian kernel and conclude the data is now
separable.
Expected accuracy dot product kernel : 50% average oscillating
between 66% and 33%
Expected accuracy with RBF kernel : 99.9% (depends on sigma)
PROBLEM 4 [Extra Credit] Dual Perceptron from Representer
Theorem
A) What is the perceptron primal loss? Write the primal perceptron
as an optimization problem.
B) Apply the Representer theorem and substitute w as a linear combination of the
training set in the objective, to obtain the dual form
PROBLEM 5 [Extra Credit] Feature Selection with kNN
Implement the RELIEF algorithm for feature selection:
independently assess the quality W of each feature by adjusting
the weight with each instance. For feature j, the update on point
x is
where zsame is the closest-to-x from the same class as x,
and zopp is the closest-to-x from the opposite
class to class of x.
Run on Spambase dataset (training) and select the top 5 features by
W. Then run the algorithm from PB1 again only with selected
features.
PROBLEM 6 [Extra Credit] Dual Regression via Representer Theorem
A)What
is the loss function that linear regression minimizes?
(Hint: answer is in the notes)
B)If
we use regularized linear regression, how can we write the final
weight vector as Equation-2. What are the alphas in this case, and
is linear regression kernelizable?
C)Compare
your developed loss function from part (b) with the Hinge and Log
loss. Can we consider linear regression as a max-margin learning
algorithm? (Hint: plot the different loss functions, and see what
happens when we try to maximize the margin for a single input data
instance)