CS6220 Unsupervised Data Mining
HW5 Topic Models, Summarization, Elastic Search
Make sure you check the syllabus for the due date. Please
use the notations adopted in class, even if the problem is stated
in the book using a different notation.
We are not looking for very long answers (if you find yourself
writing more than one or two pages of typed text per problem, you
are probably on the wrong track). Try to be concise; also keep in
mind that good ideas and explanations matter more than exact
details.
Submit all code files Dropbox (create folder HW1 or similar
name). Results can be pdf or txt files, including plots/tabels if
any.
"Paper" exercises: submit using Dropbox as pdf, either typed or
scanned handwritten.
DATATSET : 20 NewsGroups : news articles
DATATSET : DUC 2001 summarization dataset
https://www-nlpir.nist.gov/projects/duc/guidelines/2001.html
(can be found in "DM resources")
Elastic Search
You will need to install ElasticSearch and corresponding plugins in order to manipulate and visualize text
server: https://www.elastic.co
visualization, control: https://www.elastic.co/downloads/kibana
API for Java: https://www.elastic.co/guide/en/elasticsearch/client/java-api/6.2/index.html
API for Python: https://elasticsearch-py.readthedocs.io/en/master/
API for Perl: http://search.cpan.org/dist/Search-Elasticsearch/lib/Search/Elasticsearch.pm
PROBLEM 1: Text Indexing
Index each dataset separetely in Elastic Search (one index for each dataset). First set up the indexes/types/fields in Kibana, then use an API to send all docs to the index. At the minimum you will need two fields: "doc_id", and "doc_text"; you can add other fields. For DUC dataset add a field "gold_summary".
PROBLEM 2: Topic Models
Obtain Topic Models (K=10, 20, 50) for both datasets by running LDA and NMF methods; you can call libraries for both methhods and dont have to use the ES index as source. For both LDA and NMF: print out for each topic the top 20 words (with probabilities)
The rest of of topic exercises and results are required only for the LDA topics:
- 20NG: how well the topics allign with the 20NG label classes? This is not asking for a measurement, but rather for a visual inspection to determine what topics match well with what classes. Does this cnage if one increases the topics from 20 to 50?
- ES: Add a type or inew index "topic" with fields "topic_id" and "top_words" to store for each topic the top 10 words with associated probabilities.
- ES: Add a field for documents "doc_topics" and update the index to store for each document the most important topics (up to 5) and doc-topic probabilities
PROBLEM 3: Extractive Summarization
Implement the KL-Sum summarization method for each dataset. Follow the ideas in this paper ; you are allowed to use libraries for text cleaning, segmentation into sentences, etc. Run it twice :
A) KL_summary based on words_PD; PD is a distribution proportional to counts of words in document
B) LDA_summary based on LDA topics_PD on obtained in PB2. The only difference is that PD, while still a distribution over words, is computed using topic modeling
- ES: Add two new fields to the document type, "KL_summary" and "LDA_summary" to store the obtained summaries.
For DUC dataset evaluate KL_summaries and LDA_summaries against human gold summaries with ROUGE. ROUGE Perl package
EXTRA CREDIT. KL Summarization: Can we make both PD and PS distributions over topics, instead of distributions over words? Would that help?
PROBLEM 4: Simple Sampling
You are not allowed to use sampling libraries/functions. But you can use rand() call to generate a pseudo-uniform value in [0,1]; you can also use a library that computes the pdf(x|params). make sure to recap first Rejection Sampling and Inverse Transform Sampling
A. Implement simple sampling from continuous distributions: uniform (min, max, sample_size) and gaussian (mu, sigma, sample_size)
B. Implement sampling from a 2-dim Gaussian Distribution (2d mu, 2d sigma, sample_size)
C. Implement wihtout-replacement sampling from a discrete non-uniform distribution (given as input) following the Steven's method described in class ( paper ). Test it on desired sample sizes N significantly smaller than population size M (for example N=20 M=300)
PROBLEM 5: Conditional Sampling
Implement Gibbs Sampling for a multidim gaussian generative joint, by using the conditionals which are also gaussian distributions . The minimum requirement is for joint to have D=2 variables and for Gibbs to alternate between the two.
Extra Credit: Implement your own LDA using Gibbs Sampling, following this paper