CS6220 Unsupervised Data Mining

HW6 Social Graphs, Recommendation Systems

Make sure you check the syllabus for the due date. Please use the notations adopted in class, even if the problem is stated in the book using a different notation.

We are not looking for very long answers (if you find yourself writing more than one or two pages of typed text per problem, you are probably on the wrong track). Try to be concise; also keep in mind that good ideas and explanations matter more than exact details.

Submit all code files Dropbox (create folder HW1 or similar name). Results can be pdf or txt files, including plots/tabels if any.

"Paper" exercises: submit using Dropbox as pdf, either typed or scanned handwritten.


DATATSET MovieLens 100K Ratings https://grouplens.org/datasets/movielens/100k/

DATATSET Netflix Prize ratings Dataset https://www.kaggle.com/netflix-inc/netflix-prize-data

DATATSET Friendster Social Graph http://socialcomputing.asu.edu/datasets/Friendster

DATATSET Flicker Social Graph http://networkrepository.com/soc-Flickr-ASU.php, but use the one curated in DM resources



PROBLEM 1: Recommender System using Collaborative Filtering

Implement a Movie Recommendation System and run it on the Movie Lens Dataset (Train vs Test). Mesure performance on test set using RMSE
  • First you are required to compute first a user-user similarity based on ratings and movies in common
  • Second, make rating predictions on the test set followoing the KNN idea: a prediction (user, movie) is the weighted average of other users' rating for the movie, weighted by user-similarity to the given user.




  • PROBLEM 2: EXTRA CREDIT Netflix Recommendations

    Implement a recommender system on the Netflix Prize Dataset. Use the "probe" set for testing. For competitive systems one needs to add movie content features such as actors, genres, directors, music, etc. These features are not available from Netflix, but for some movies they have been crawled by Movie Title form other websites such as IMDB (for example "movie_details.xml" file in DM_resources, but you can get more such features on your own.)



    PROBLEM 3A: Social Community Detection

    Implement edge-removal community detection algorithm on the Flicker Graph. Use the betweeness idea on edges and the Girvan–Newman Algorithm. The original dataset graph has more than 5M edges; in DM_resources there are 4 different sub-sampled graphs with edge counts from 2K to 600K; you can use these if the original is too big.
    You should use a library to support graph operations (edges, vertices, paths, degrees, etc). We used igraph in python which also have builtin community detection algorithms (not allowed); these are useful as a way to evaluate communities you obtain


    PROBLEM 3B: Social Community Detection (extra credit)

    Implement the modularity detection algorithm on the Flicker Graph in previous problem. You will need to compute the modularity matrix B and its highest-val eigenvector V1. The split vector S (+1 / -1) aligns by sign with V1; follow this paper. Split the graph repeatedly in two parts until the desired number of components is obtained.


    PROBLEM 4: Knowledge Base Question Answering

    Given is knowledge graph with entities and relations, questions with starting entity and answers, and their word embedding . For each question, navigate the graph from the start entity outwards until you find appropriate answer entities.
    Utils functions (similarity, load_graphs) are given, but you can choose not to use them. This python file contains the helper functions for this homework, the only update needed to use this file is to fill in the file paths.

    - The number of correct answers varies (could be 1, could many), use F1 to compare your answers with the given correct answers
    - Utils functions (similarity, load_graphs) are given, but you can choose not to use them.
    - Answers are given to be used for evaluation only, DO NOT USE ANSWERS IN YOUR GRAPH TRAVERSAL.
    Your strategy should be a graph traversal augmented with scoring of paths; you might discard paths not promising along the way. This is similar to a focused crawl strategy. You will take a query (question) that you are trying to answer, it will have a starting entity. Begin your traversal at that starting entity, and look at all adjacent edges. Use get_rel_score_word2vecbase to get a similarity score for each edge, and traverse the edges that are promising. This part is up to you, you can cut off scores below a certain threshold, or take only the top percentage, or weight it based on the average.

    There are many valid strategies. You will continue to traverse a path, until the score starts to decrease, or you notice the similarity score drops significantly (compared to the previous edges). Overall try a few different approaches, and choose one that gives you the best overall F1 score.