CS6220: Data Mining Techniques

CS 6220: Data Mining Techniques

This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.

News

[04/27/2011] Final results of data mining competition posted
[04/13/2011] Slides from April 12 lecture posted
[04/07/2011] Slides from April 5 lecture posted
[04/04/2011] Final HW available on Blackboard
[03/31/2011] Slides from Mar 29 lecture posted
[03/23/2011] Slides from Mar 22 lecture posted
[03/09/2011] Slides from Mar 8 lecture posted
[02/23/2011] Slides from Feb 22 lecture posted
[02/18/2011] HW 3 available on Blackboard
[02/16/2011] Slides from Feb 15 lecture posted

Lectures

Larger version of slides (2 per page)

(Future lectures and events are tentative.)

Date	Topic	Remarks and Homework
January 11	Introduction; Data Preprocessing	Read chapters 1 and 2 in the book.
January 18	Data Preprocessing; Classification and Prediction	Read relevant sections in chapter 2.
January 25	Classification and Prediction	Read relevant sections in chapter 6.
January 26	HW 1 due at 11pm	Submit it through Blackboard.
February 1	No class due to inclement weather.	The university canceled all classes after 4pm.
February 2	HW 2 due at 11pm	Submit it through Blackboard
February 8	Classification and Prediction	Read relevant sections in chapter 6. For more information, also look at references [1] for trees and [5] for statistical decision theory (see below).
February 15	Classification and Prediction	Read relevant sections in chapter 6. For more information about the bias-variance tradeoff, look at Geman92.pdf (uploaded on Blackboard). Optional HW: Go over the Naive Bayes computation example on slide 110 and make sure you can do this on your own for any given input record.
February 22	Classification and Prediction	Read relevant sections in chapter 6. Reference [2] is an excellent source for more information about artificial neural networks.
February 25	HW 3 due at 11pm	Submit it through Blackboard.
March 1	No class (Spring Break)
March 8	Classification and Prediction	Read relevant sections in chapter 6. If you are interested in more technical information about SVMs, take a look at SVMoverview.pdf (uploaded on Blackboard). Depending on your math background, some sections might be difficult to understand in detail, but the general idea will be clear. There is no homework this week other than studying for the midterm.
March 15	Midterm exam.	Same start time and location as class.
March 22	Classification and Prediction; Frequent Patterns	Read relevant sections in chapters 6 and 5. Slides 230-236 and the discussion about Groves represent advanced material for those interested in learning more, but are not required reading for this class. Similarly, the specific formulas for the boosting algorithm are optional reading, but you need to know the basic functionality.
March 29	Frequent Patterns	Read relevant sections in chapter 5. Go over the examples for Apriori and FP-growth (look at the textbook for more details) and make sure you can run the algorithms manually on small examples. Compare how FP-growth explores the itemset lattice differently than Apriori.
April 5	Frequent Patterns	Read relevant sections in chapter 5. For the example on slide 48 (better: create your own small example), find the maximal and closed frequent itemsets for min_sup=3 and min_sup=1. Practice the computation of lift and discuss why support and confidence might not be good enough in practice. Explain in which order GSP and PrefixSpan explore possible sequences. How do these algorithms differ in the way they are pruning the search space? (Hint: Use the tree of sequences as presented in class, which is similar to the itemset lattice and has 1-item sequences in the first level, 2-item sequences in the second and so on.) What are the main similarities and differences between Apriori and GSP? What are the main similarities and differences between FP-Growth and PrefixSpan?
April 12	Clustering	Read relevant sections in chapter 7. For additional information, look at reference [1].
April 18	Project pre-submission due at 11pm	Submit it through Blackboard.
April 19	Review, Project discussion, Data Warehousing and OLAP overview
April 22	Project final submission due at 11pm	Submit all files through Blackboard.
April 26	Final exam 6-8pm in usual classroom

Data Mining Competition Ranking

Team	Pre-Submission Accuracy	Team	Final Accuracy on testset1 (and testset0)
8	79.521	1	81.242 (81.13)
2	78.927	2	80.530 (78.93)
3	78.792	5	80.046 (79.95)
1	77.667	8	80.033 (80.01)
5	76.676	6	80.021 (80.01)
7	76.280	3	79.923 (79.94)
4	74.945	7	79.741 (79.72)
6	74.512	4	79.604 (79.57)

Intermediate Results

Team	Latest Accuracy
1	80.313
6	80.006
8	79.521
2	78.927
3	78.792
7	78.162
5	78.143
4	74.945

Course Information

Instructor: Mirek Riedewald

Office hours: Tuesday 4:30-5:30 PM and Thursday 11AM-noon
Send email to set up an appointment if you cannot make it during these times

TA: We have no TA this semester :-(

Meeting times: Tue 6 - 9 PM
Meeting location: 108 WVH

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Homework: 40%
Midterm exam: 30%
Final exam: 30%

Textbook

Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.