This is a Demo on IMDB dataset, where documents are movie reviews. Labels are annotations “good” or “bad” for each review obtained from ratings. The purpose of a model is to predict the “good” or “bad” from the review text. It is essentially the same problem as predicting “spam” or “not spam” from the email text. --------------------------------train ------------------------- (pos) I highly recommend this movie. (neg) I do not recommend this movie to anybody. (neg) It is a waste of time. (pos) Good fun stuff ! (neg) It's just not worth your time. -------------------------------------------------------------- ---------------------------------- test -------------------------- (neg) I do not recommend this movie unless you are prepared for the biggest waste of money and time of your life. (neg) This movie was the slowest and most boring so called horror that I have ever seen. (neg) The film is not worth watching. (pos) A wonderful film (pos) This is a really nice and sweet movie that the entire family can enjoy. ------------------------------------------------------------------- Gather ngrams only once from the training set. Use these ngrams to compute matching scores for both training set and test set. Make sure the same ngrams are used and the orders are the same. Procedure: connected to index there are 10 documents in the index. number of training documents = 5 there are 2 classes in the training set. label distribution in training set: neg:3, pos:2, LabelTranslator{intToExt={0=neg, 1=pos}, extToInt={neg=0, pos=1}} fields to be considered = [body] gathering 1-grams from field body with slop 0 and minDf 0 gathered 22 1-grams gathering 2-grams from field body with slop 0 and minDf 0 gathered 21 2-grams there are 43 ngrams in total creating training set allocating 43 columns for training set training set created data set saved to /huge1/people/chengli/projects/pyramid/archives/exp35/imdb_toy/1/train creating test set allocating 43 columns for test set test set created data set saved to /huge1/people/chengli/projects/pyramid/archives/exp35/imdb_toy/1/test The format that we want: an on-disk sparse matrix In each line, the first number is the label. The rests are feature index: feature value pairs. The feature index starts at 0. Since the feature matrix is very sparse, only non-zero feature values are stored. We expect features not listed to have value 0. Two steps: 1. gather ngrams, 2. computing matching scores Enumerating ngrams: Scan all documents. For each document, pull out the term vector. Get sorted list. Scan the list. Computing matching scores: Fundamental constraint: cannot hold the entire dense matrix in memory sparse matrix options: 1. use a sparse matrix library python: numpy sparse matrix http://docs.scipy.org/doc/scipy/reference/sparse.html java: Mahout sparse matrix or Guava table http://mahout.apache.org/ http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/collect/Table.html WARNING: Be careful with complexity of the operations 2. write your own data structure array of hash maps