IR Demo on 07/28

Goal: extract all unigrams from elastic search and dump them into local files.

Step 0: first look on dataset.

Dataset: spam trec 07, which is a email messages dataset. Total number of email messages = 75,419.

There are two parts for this dataset:

1) file_name: ./full/index, for example the first email, which is a spam, and the original email message is stored in ./data/inmail.1 file.

2) email content, for example: ./data/inmail.1:

Step 1: Index all Spam Trec07 files into elastic search: For example:

Each document includes: label(spam or ham), body(email contents), split(train or test [4:1 splitting rate.]).

———————————> dump data from Step 1 into feature matrix: train.txt, test.txt.

Solve 3 main problems:

1) red arrow: each line in feature matrix stands for which email from elastic search.

2) blue arrow: the label here (1, 0) in feature matrix stands for what? (spam or ham).

3) yellow line: each sparse feature: e.g. 40:2 stands for what? (40 stands for each term index, 2 is the tf for the term in this email.)

Step 2: get ids for train and test:

train_list = [id_1,id_3...]

test_list = [id_0,….]

Step 3: dump the train and test ids from elastic search into the local files:

For example, named: train_ids_list.txt, test_ids_list.txt.

7015 is the line number of feature_matrix - 1.

0 is the id for inmail.1 in elastic search.

Step 4: dump labels(spam & ham) from elastic search into 0 & 1 in our feature matrix.

Step 5: build feature index list from training set.

This step is to map each term into unique index number, e.g.

Step 6: dump train/test into feature matrix.

index:value:

index is based on feature_list.txt

value is the tf count, you can use tf-idf or other score from elastic search.

Step 7: Running lib linear classification on train.txt, and test.txt