--- date: "2015-02-16T11:47:00-05:00" draft: true title: "Homework 5" menu: main: parent: "Homework" --- # Objective Build a Spam Classifier using Machine Learning and ElasticSearch. # Data Consider the trec07_spam set of documents annotated for spam, available "data resources". Index the documents with ElasticSearch, and make sure to have a field "spam" with values "yes" or "no" for each document. All these documents will be used as training. You are encouraged to look for other annotated spam datasets and add them to the index, in order to have a larger training corpus. # Spam Features Manually create a list of ngrams (unigrams, bigrams, trigrams, etc) that you think are related to spam. For example : "free" , "win", "porn", "click here", etc. These will be the features (columns) of the data matrix. You will have to use ElasticSearch querying functionality in order to create feature values for each document, for each feature. There are way to ask ES to give all matches (aka feature values) for a ngram, so you dont have to query (feature, doc) for all docs separately. # Train a learning algorithm The label, or outcome, or target are the spam annotation "yes" / "no" or you can replace that with 1/0. Using the "train" queries static matrix, train a learner to compute a model relating labels to the features. You can use a learning library like [SciPy/NumPy](http://www.scipy.org), [C4.5](https://github.com/scottjulian/C4.5), [Weka](http://www.cs.waikato.ac.nz/ml/weka/), [LibLinear](http://www.csie.ntu.edu.tw/~cjlin/liblinear/), [SVM Light](http://svmlight.joachims.org), etc. The easiest models are linear regression and decision trees. # Test the spam model Test the model on your crawl from HW3. You will have to create a testing data matrix with feature values in the same exact way as you created the training matrix: use ElasticSearch to query for your features, use the scores are feature values. 1. Run the model to obtain scores 2. Treat the scores as coming from an IR function, and rank the documents 3. Display first few "spam" documents and visually inspect them. You should have these ready for demo. *IMPORTANT* : Since they are likely to be spam, if you display these in a browser, you should turn off javascript execution to protect your computer. #EC Figure out how to extract the spam feature list as unigrams and bigrams automatically from the training set, instead of manually creating it. ### Rubric