Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.
We are not looking for very long answers (if you find yourself writing more than 1-2 page(s) of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.
Consider the following neural network (left graph), with 8 input units (for data with 8 features), 3 hidden units and 8 output units, and assume the nonlinear functions are all sigmoid.
a)The 8 training data inputs are identical with the outputs, as shown in the right side table. Implement this network and the backpropagation algorithm to compute all the network weights; you should initialize the weights with nontrivial values (i.e not values that already minimize the erorr).
HINT: on the trained network, you should obtain values for the hidden units somehow similar with the ones shown in the table (up to symmetry). Feel free to make changes to the algorithm that suit your implementation, but briefly document them.
b) Since the outputs and inputs are identical for each datapoint, one can view this network as an encoder-decoder mechanism. (this is one of the uses of neural networks). In this context, explain the purpose of the training algorithm. (I expect a rather nontechnical --but documented-- answer).
c) Can this encoder-decoder scheme work with 1 hidden unit? Can it work with 2 hidden units? What if there are more than 8 inputs and outputs? Justify your answers mathematically.
Submit a concise report.
The Spambase data set consists of 4,601 e-mails, of which 1,813 are spam (39.4%). The data set archive contains a processed version of the e-mails wherein 57 real-valued features have been extracted and the spam/non-spam label has been assigned. You should work with this processed version of the data. The data set archive contains a description of the features extracted as well as some simple statistics over those features.
To estimate the generalization (testing) error of your classifier, you will perform cross-validation. In k-fold cross-validation, one would ordinarily partition the data set randomly into k groups of roughly equal size and perform k experiments (the "folds") wherein a model is trained on k-1 of the groups and tested on the remaining group, where each group is used for testing exactly once. (A common value of k is 10.) The generalization error of the classifier is estimated by the average of the performance across all k folds.
While one should perform cross-validation with random partitions, for consistency and comparability of your results, you should partition the data into 10 groups as follows: Consider the 4,601 data points in the order they appear in the processed data file. Each group will consist of every 10th data point in file order, i.e., Group 1 will consist of points {1,11,21,...}, Group 2 will consist of points {2,12,22,...}, ..., and Group 10 will consist of ponts {10,20,30,...}. Finally, Fold k will consist of testing on Group k a model obtained by training on the remaining k-1 groups.
The 57 features are real-valued, and one can model the feature distributions in simple and complex ways. You will explore the effect of this modeling on overall performance.
and use these estimated values in your Naive Bayes predictor, as appropriate. (I would suggest using Laplace or Laplace-like smoothing to avoid any issues with zero probabilities, and feel free to experiment with various theshold values, if you like.)
and estimate the class conditional probabilities for each of these bins, in a manner similar to that used in the Bernoulli model above. (Again, use Laplace smoothing to avoid any issues with zero probabilities, and feel free to experiment with the number of bins and various theshold values, if you like.)
When using Naive Bayes, one can easily make such trade-offs. For example, in the usual Bayes formulation, one would predict "spam" if
or equivalently, in a log-odds formulation,
Note that one could choose to classify an e-mail as spam for any threshold tau
where positive values of tau effectively reduce the number of spam classifications, thus decreasing false positives at the expense of increasing false negatives, while negative values of tau have the opposite effect.
One mechanism for visualizing this trade-off is through an ROC curve. (See, for example, the Wikipedia page on ROC curves.) An ROC curve effectively visualizes the true positive (detection) rate vs. the false positive (false alarm) rate for all possible thresholds. (The true positive rate is the fraction of spam messages above threshold; the false positive rate is the fraction of non-spam messages above threshold. Note that the true positive rate is 1 minus the false negative rate.)
One simple way to generate an ROC curve is to sort all m testing points by their log-odds, and for the top k testing points, for all k between 1 and m, calculate the true positive and false positive rates associated with the top k items; plot these points.
Create an ROC curve for each of your classifiers for Fold 1. Preferable draw all three curves on the same plot so that you can compare them.
The AUC can be calculated fairly simply using the trapezoidal rule for estimating the area under a curve: If our m ROC data points are
(x1, y1), (x2, y2), ..., (xm, ym)
in order, where
(x1, y1) = (0,0)
and
(xm, ym) = (1,1)
then the AUC is calculated as follows:
(1/2) sumk=2 to m (xk - xk-1) (yk + yk-1).
Calculate the AUC of your three classifiers.
a) Prove that
a) DHS CH2, Pb 45
Read prof Andrew Ng's lecture on ML practice advice. Write a brief summary (1 page) explaining the quantities in the lecture and the advice.
For a function f(x1,x2,..., xn) with real values, the "Hessian" is the matrix of partial second derivatives
Consider the log-likelihood function for logistic regression
Show that its Hessian matrix H is negative semidefinite, i.e. for any vector z satisfies
Remark: This fact is sometimes written and implies the log-likelihood function is concave.
Hint: