CS6200 Information Retrieval
Homework6: Spam Classifier - Extra Credit


Objective

Build a Spam Classifier using Machine Learning and ElasticSearch.

Data

Consider the trec07_spam set of documents annotated for spam, available “data resources”. Index the documents with ElasticSearch, and make sure to have a field “spam” with values “yes” or “no” for each document.

All these documents will be used as training. You are encouraged to look for other annotated spam datasets and add them to the index, in order to have a larger training corpus.

Spam Features

Manually create a list of ngrams (unigrams, bigrams, trigrams, etc) that you think are related to spam. For example : “free” , “win”, “porn”, “click here”, etc. These will be the features (columns) of the data matrix.

You will have to use ElasticSearch querying functionality in order to create feature values for each document, for each feature. There are ways to ask ES to give all matches (aka feature values) for a given ngram, so you dont have to query (ngram, doc) for all docs separately.

Train a learning algorithm

The label, or outcome, or target are the spam annotation “yes” / “no” or you can replace that with 1/0.

Using the “train” queries static matrix, train a learner to compute a model relating labels to the features. You can use a learning library like SciPy/NumPy, C4.5, Weka, LibLinear, SVM Light, etc. The easiest models are linear regression and decision trees.

Test the spam model

Test the model on your crawl from HW3. You will have to create a testing data matrix with feature values in the same exact way as you created the training matrix: use ElasticSearch to query for your features, use the scores are feature values.

  1. Run the model to obtain scores
  2. Treat the scores as coming from an IR function, and rank the documents
  3. Display first few “spam” documents and visually inspect them. You should have these ready for demo. IMPORTANT : Since they are likely to be spam, if you display these in a browser, you should turn off javascript execution to protect your computer.

EC

Figure out how to extract the spam feature list as unigrams and bigrams automatically from the training set, instead of manually creating it.

Rubric

20 points
A reasonable list of spam features
30 points
ES queries for feature values, training
10 points
ES queries for feature values, testing
20 points
Training the model
20 points
Run model on testing data and display top spam docs