CS6200
Information Retrieval
Homework8:
Clustering and Topic Models
Objective
In this assignment, you will cluster documents,
detect topics, and represent documents in topic space.
A) Topic Models per Query
For any given query, select a set of documents that is
union of top 1000 BM25 docs (for that query) and the
qrel docs (for that query). The set will be a little
over 1000 docs since there will be considerable overlap.Then use one of the packages listed below to perform LDA. Your output must contain
- the topics detected, with each topic represented as distribution over words (list the top 20-30 words)
- the documents represented as topic distribution. Each document should have listed the most relevant 10 topics or so
LDA option 1 : vowpal_wabbit(Bishwajeet) . Harder to
use, but the best tool in practice for large data
1. On Ubuntu, vowpal_wabbit may be installed using the
following set of commandssudo apt-get install libboost-program-options-dev zlib1g-dev
sudo apt-get install libboost-python-dev
git clone git://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make
make install #may require root previleges
2. Input Data format for LDA
1. The expected input format is
| <num-1> <num-2> <num-3>
| <num-5> <num-2>
Each line corresponds to a document.
Use your own hashing function to convert from word to a number and make sure the
word representation is consistent across the different tuples of the input. In
the above example, num-2 appears in both the documents ie. the same word appears
in both the documents. These words should be consistently encoded into numbers
3. Running the LDA
The LDA may be invoked with the following commandline args
vw -d <input-file-name> --lda <num-topics> --lda_D <total-num-docs> --passes 10 -c -k --readable_model <model-file>
This will read from <input-file-name> and output the readable model(ie. not binary) file to <model-file>.
The other command line options may be found here - https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments
4. Making sense of the readable model file
The output format of the file is:
1. The first 10 lines contain metadata. They need to be ignored.
2. From the 11th line onwards, all lines will contain 10(assuming you wanted 10 topics) floating point numbers. Those are the topics.
3. The line number tells you the number the word represents(ie <num-1>, <num-3> etc)
4. Sort each column and then print the word in the following way.
4.1 Get the top 10 column values in column 0. Note their indices.
4.2 Print them by spreading aross all the topics. So
Topic-0 Topic-1 Topic-2 ..
col0-0 col0-1 col0-2 ..
4.3 Do the above 2 steps for the required number of topics.
5. General Reference
https://github.com/JohnLangford/vowpal_wabbit
LDA option 2 : python sklearn (Nikhil)
1) Install scikit i.e. sklearn package for python2)Convert the Top set of documents for each query into document-term matrix. You can use CountVectorizer from sklearn.feature_extraction.text package
It is good idea to restrict stop words, words with very high document frequency and also words not occouring more than say 1 or 2 time. These all can be passed in CountVectorizer
3)Train by fitting the sparse matrix obtained from above to LatentDirichletAllocation from sklearn.decomposition. This will fetch N topics with M features as word. Sort features probability to get top K words.
4)Transform the model to get distribution of topic over documents. Sort them again and list top 3 topics per document.
Also fetch top 3 topics for each query by performing by rank measure of topics or by checking for feature occurrences in each document and sorting top 3 topic for max score
LDA option 3 : Python LDA (Rashmi) - easiest to use
Python LDA : https://pypi.python.org/pypi/lda
The above link has the documentation for using the
library assuming the dataset is already available as
lda.datasets.
For custom datasets, we need to follow below steps:
1. Import numpy
2. Import lda
3. Convert the Top set of documents
for each query into document-term matrix. This can be
carried out using various python packages. For ex:
textmining or nltk
4. Fit the document-term matrix to
LDA
5. Model provides 2 probability
distribution after computation
a. Topic – word
probability distribution. Lookup with document-term
matrix, the top 10-20 words will form the topic.
b. Document –
topic probability distribution
LDA option 4 : Mallet (java)
http://mallet.cs.umass.edu/topics.php
http://programminghistorian.org/lessons/topic-modeling-and-mallet
https://piazza.com/class/io1w9cnmpk23i6?cid=195
B) LDA-topics and clustering
Run LDA on the entire AP89 collection, with about
T=200 topics. Obtain a representation of all documents
in these topics.
Then run a clustering-partition algorithm on documents
in this topic representation. partition means every
document gets assigned to exactly one cluster, like
with K-means. You are free to use any clustering
library/package. Target K=25 clusters. List each
cluster with its documents IDs. You should have an ES
index set up so we can look up documents by ID.
How to evaluate: There are about 1831 relevant
documents in total. Consider all the pairs of two
relevant documents, that is (1831 choose 2). For each
pair, count a 1 in the appropriate cell of the
following confusion table:
same cluster |
different clusters |
|
same query |
||
different queries |
Rubric : OPtional, but can make up 50 points lost on HW1-7
- 7 points
- Set up the data per query
- 10 points
- Correctly set up and run LDA
- 10 points
- Get the output, both docs->topics and
topics->words probabilities/scores
- 8 points
- Analyze the result by manual inspection. Does the docs->topics representation makes sense for the relevant documents. Only need to check 5 docs/query for a few queries
- 15 points
- part B