CS6220/DS5230 Unsupervised Data Mining
HW1 Data Features, Similarity, KNN
Make sure you check the syllabus for the due date. Please
use the notations adopted in class, even if the problem is stated
in the book using a different notation.
We are not looking for very long answers (if you find yourself
writing more than one or two pages of typed text per problem, you
are probably on the wrong track). Try to be concise; also keep in
mind that good ideas and explanations matter more than exact
details.
DATATSET : Kosarak : click-stream data of a hungarian on-line news portal
(http://fimi.uantwerpen.be/data/kosarak.dat)
DATATSET : Aminer : public citation dataset
(https://lfs.aminer.cn/lab-datasets/citation/acm.v9.zip)
DATATSET : 20 NewsGroups : news articles
( http://qwone.com/~jason/20Newsgroups/)
DATATSET : MNIST : digit images
(http://yann.lecun.com/exdb/mnist/)
PROBLEM 1: Aminer : basic dataset analysis
This is a large dataset (about 2 million publications – it takes about a minute just to parse!). While your notebook must successfully work on the entire dataset, you may find it useful to work on a subset while getting your code to work.
- A. Compute the number of distinct authors, publication venues, publications, and citations/references
- B. Are these numbers likely to be accurate? As an example look up all the publications venue names associated with the conference “Principles and Practice of Knowledge Discovery in Databases” – what do you notice?
- C. For each author, construct the list of publications. Plot a histogram of the number of publications per author (use a logarithmic scale on the y axis)
- D. Calculate the mean and standard deviation of the number of publications per author. Also calculate the Q1 (1st quartile14), Q2 (2nd quartile, or median) and Q3 (3rd quartile) values. Compare the median to the mean and explain the difference between the two values based on the standard deviation and the 1st and 3rd quartiles.
- E. Now plot a histogram of the number of publications per venue, as well as calculate the mean, standard deviation, median, Q1, and Q3 values. What is the venue with the largest number of publications in the dataset?
- F. Plot a histogram of the number of references (number of publications a publication refers to) and citations (number of publications referring to a publication) per publication. What is the publication with the largest number of references? What is the publication with the largest number of citations? Do these make sense?
- G. Calculate the so called “impact” factor for each venue. To do so, calculate the total number of citations for the publications in the venue, and then divide this number by the number of publications for the venue. Plot a histogram of the results
- H. What is the venue with the highest apparent impact factor? Do you believe this number?(http://mdanderson.libanswers.com/faq/26159)
- I. Now repeat the calculation from item G, but restrict the calculation to venues with at least 10 publications. How does your histogram change? List the citation counts for all publications from the venue with the highest impact factor. How does the impact factor (mean number of citations) compare to the median number of citations?
- J. Finally, construct a list of publications for each publication year. Use this list to plot the average number of references and average number of citations per publication as a function of time. Explain the differences you see in the trends.
https://en.wikipedia.org/wiki/IPython#Notebook
https://aminer.org
https://en.wikipedia.org/wiki/ECML_PKDD
https://en.wikipedia.org/wiki/Quartile
PROBLEM 2 : Kosarak Association Rules
Your task is to take a dataset of nearly one million clicks on a news site16 and use the Weka Explorer to identify interesting association rules. Ordinarily this would be a point-and-click task; however, the input data format is a list of transactions (each line in the file includes a list of anonymized news item id’s), whereas Weka requires a tabular format. Specifically, each distinct news item id should be represented via a column/attribute, and each row/instance should be a sequence of binary values, indicating whether or not the user visited the corresponding news item.
- A. Write a Python program which takes as its argument5 the path to a text file of data (assumed to be in the itemset format above) and produces as output to the console a sparse ARFF file.
- B. Use your program to convert the kosarak.dat file to a sparse kosarak.arff. About how long did it take to run?
- C. Load the resulting file into Weka (as described above; you should have 41,270 attributes and 990, 002 instances). About how long did it take to load this file?
- D. Use Weka’s FP-Growth implementation to find rules that have support count of at least 49, 500 and confidence of at least 99% – record your rules (there should be 2).
- E. Run the algorithm at least 5 times. Then look to the log and record how much time each took. How does the average time compare to the time necessary to convert the dataset and then load into Weka?
PROBLEM 3 MNIST, 20 NG Preprocessing
Your goal in this problem is to *parse*, *normalize*, and otherwise *prepare* two common data sets (MNIST + 20NG) for classification. In this problem, that includes prepping the datasets for **[dis]similarity** computations.
Your first task is parsing. As this is the first assignment and as the parsers are very different for the two datasets (images vs. text), you may use any library/package to aid in the parsing here, however you are encouraged to write your own.
Your second task is normalization. The type of normalization used depends on the task and dataset. Common types of normalization include:
Shift-and-scale normalization: subtract the minimum, then divide by new maximum. Now all values are between 0-1
Zero mean, unit variance : subtract the mean, divide by the appropriate value to get variance=1
Term-Frequency (TF) weighting : map each term in a document with its frequency (text only; see the wiki page)
It is up to you to determine the appropriate normalization.
Your final task is to compute several types of pairwise similarities, for use in the next question. You are encouraged to write your own implementation of the pairwise similarity/distance matrix computation---but unless explicitly specified we will accept any code using a common library available in Matlab/Java/Python/R.
Distance/similarity options to implement:
euclidian distance (required, library)
euclidian distance (required, your own - use batches if you run into memory issues)
edit distance (required for text) -or- cosine similarity (required for vectors)
jaccard similarity (optional)
Manhattan distance (optional)
Some tips / hints:
For 20NG, try TF normalization on e.g. the rows. Note for text is critical to maintain a sparse format due to large number of columns
Make sure any value transformation retains the 0 values.
As MNIST is comprised of pixel images, they often come 'normalized' in a pre-formatted range [0-255], however their are advantages to having 0 mean.
Do not normalize labels.
When normalizing a column, make sure to normalize its values across all datapoints (train, test, validation, etc) for consistency
Depending on what similarity/distance measure you use, computation of similarity might be easy but the size of the similarity matrix might present a challenge.
Useful links include:
https://en.wikipedia.org/wiki/Feature_scaling
https://en.wikipedia.org/wiki/Distance_matrix
https://en.wikipedia.org/wiki/Distance
https://en.wikipedia.org/wiki/Category:Similarity_and_distance_measures
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.154.8446&rep=rep1&type=pdf
http://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/
PROBLEM 4: MNIST, 20 NG : Train and test KNN classification (supervised)
Your goal in this problem is to write your own K-nearest neighbor (KNN) classifier.
For each of the two datasets, now in matrix format and with pairwise similarity computed, train and test a KNN classifier. You are required to implement KNN classification model yourself, though you may use support libraries / data-structures for the neighbor searching.
You should partition the datasets into (say) an 80/10/10 training/testing/validation sets. Note that the actual "training" here consists of simply identifying nearest neighbors---unlike other common classifiers, there is no iterative or gradient-based procedure.
Report both training performance and testing performance. If using Python, you are encouraged (but not required) to write a scikit-learn compatible *estimator* class supporting a common API interface, e.g. *.fit(), *.predict(), *.transform(), etc. See https://scikit-learn.org/stable/developers/develop.html for more details.