CS 6220: Data Mining Techniques

This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.


News

[04/27/2011] Final results of data mining competition posted
[04/13/2011] Slides from April 12 lecture posted
[04/07/2011] Slides from April 5 lecture posted
[04/04/2011] Final HW available on Blackboard
[03/31/2011] Slides from Mar 29 lecture posted
[03/23/2011] Slides from Mar 22 lecture posted
[03/09/2011] Slides from Mar 8 lecture posted
[02/23/2011] Slides from Feb 22 lecture posted
[02/18/2011] HW 3 available on Blackboard
[02/16/2011] Slides from Feb 15 lecture posted


Lectures

Larger version of slides (2 per page)

(Future lectures and events are tentative.)

Date Topic Remarks and Homework
January 11 Introduction; Data Preprocessing Read chapters 1 and 2 in the book.
January 18 Data Preprocessing; Classification and Prediction Read relevant sections in chapter 2.
January 25 Classification and Prediction Read relevant sections in chapter 6.
January 26 HW 1 due at 11pm Submit it through Blackboard.
February 1 No class due to inclement weather. The university canceled all classes after 4pm.
February 2 HW 2 due at 11pm Submit it through Blackboard
February 8 Classification and Prediction Read relevant sections in chapter 6. For more information, also look at references [1] for trees and [5] for statistical decision theory (see below).
February 15 Classification and Prediction Read relevant sections in chapter 6. For more information about the bias-variance tradeoff, look at Geman92.pdf (uploaded on Blackboard).

Optional HW: Go over the Naive Bayes computation example on slide 110 and make sure you can do this on your own for any given input record.
February 22 Classification and Prediction Read relevant sections in chapter 6. Reference [2] is an excellent source for more information about artificial neural networks.
February 25 HW 3 due at 11pm Submit it through Blackboard.
March 1 No class (Spring Break)  
March 8 Classification and Prediction Read relevant sections in chapter 6.

If you are interested in more technical information about SVMs, take a look at SVMoverview.pdf (uploaded on Blackboard). Depending on your math background, some sections might be difficult to understand in detail, but the general idea will be clear.

There is no homework this week other than studying for the midterm.
March 15 Midterm exam. Same start time and location as class.
March 22 Classification and Prediction; Frequent Patterns Read relevant sections in chapters 6 and 5.

Slides 230-236 and the discussion about Groves represent advanced material for those interested in learning more, but are not required reading for this class. Similarly, the specific formulas for the boosting algorithm are optional reading, but you need to know the basic functionality.
March 29 Frequent Patterns Read relevant sections in chapter 5.

Go over the examples for Apriori and FP-growth (look at the textbook for more details) and make sure you can run the algorithms manually on small examples. Compare how FP-growth explores the itemset lattice differently than Apriori.
April 5 Frequent Patterns Read relevant sections in chapter 5.

For the example on slide 48 (better: create your own small example), find the maximal and closed frequent itemsets for min_sup=3 and min_sup=1.

Practice the computation of lift and discuss why support and confidence might not be good enough in practice.

Explain in which order GSP and PrefixSpan explore possible sequences. How do these algorithms differ in the way they are pruning the search space? (Hint: Use the tree of sequences as presented in class, which is similar to the itemset lattice and has 1-item sequences in the first level, 2-item sequences in the second and so on.)

What are the main similarities and differences between Apriori and GSP? What are the main similarities and differences between FP-Growth and PrefixSpan?
April 12 Clustering Read relevant sections in chapter 7. For additional information, look at reference [1].
April 18 Project pre-submission due at 11pm Submit it through Blackboard.
April 19 Review, Project discussion, Data Warehousing and OLAP overview  
April 22 Project final submission due at 11pm Submit all files through Blackboard.
April 26 Final exam 6-8pm in usual classroom  

Data Mining Competition Ranking

Team Pre-Submission Accuracy Team Final Accuracy on testset1 (and testset0)
8 79.521 1 81.242 (81.13)
2 78.927 2 80.530 (78.93)
3 78.792 5 80.046 (79.95)
1 77.667 8 80.033 (80.01)
5 76.676 6 80.021 (80.01)
7 76.280 3 79.923 (79.94)
4 74.945 7 79.741 (79.72)
6 74.512 4 79.604 (79.57)

Intermediate Results

Team Latest Accuracy
1 80.313
6 80.006
8 79.521
2 78.927
3 78.792
7 78.162
5 78.143
4 74.945

Course Information

Instructor: Mirek Riedewald

TA: We have no TA this semester :-(

Meeting times: Tue 6 - 9 PM
Meeting location: 108 WVH

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Textbook

Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006

Recommended books for further reading:

  1. "Data Mining" by Pang-Ning Tan, Michael Steinbach, and Vipin Kumar (http://www-users.cs.umn.edu/~kumar/dmbook/index.php)
  2. "Machine Learning" by Tom Mitchell (http://www.cs.cmu.edu/~tom/mlbook.html)
  3. "Introduction to Machine Learning" by Ethem ALPAYDIN (http://www.cmpe.boun.edu.tr/~ethem/i2ml/)
  4. "Pattern Classification" by Richard O. Duda, Peter E. Hart, David G. Stork (http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471056693.html)
  5. "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani, and Jerome Friedman (http://www-stat.stanford.edu/~tibs/ElemStatLearn/)

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.