This course covers various aspects of data mining including data preprocessing, classification, ensemble methods, association rules, sequence mining, and cluster analysis. The class project involves hands-on practice of mining useful knowledge from a large data set.
[04/27/2011] Final results of data mining competition posted
[04/13/2011] Slides from April 12 lecture posted
[04/07/2011] Slides from April 5 lecture posted
[04/04/2011] Final HW
available on Blackboard
[03/31/2011] Slides from Mar 29 lecture posted
[03/23/2011] Slides from Mar 22 lecture posted
[03/09/2011] Slides from Mar 8 lecture posted
[02/23/2011] Slides from Feb 22 lecture posted
[02/18/2011] HW 3 available on Blackboard
[02/16/2011] Slides from Feb 15
lecture posted
Larger version of slides (2 per page)
(Future lectures and events are tentative.)
Date | Topic | Remarks and Homework |
January 11 | Introduction; Data Preprocessing | Read chapters 1 and 2 in the book. |
January 18 | Data Preprocessing; Classification and Prediction | Read relevant sections in chapter 2. |
January 25 | Classification and Prediction | Read relevant sections in chapter 6. |
January 26 | HW 1 due at 11pm | Submit it through Blackboard. |
February 1 | No class due to inclement weather. | The university canceled all classes after 4pm. |
February 2 | HW 2 due at 11pm | Submit it through Blackboard |
February 8 | Classification and Prediction | Read relevant sections in chapter 6. For more information, also look at references [1] for trees and [5] for statistical decision theory (see below). |
February 15 | Classification and Prediction | Read relevant sections in chapter 6. For more
information about the bias-variance tradeoff, look at Geman92.pdf
(uploaded on Blackboard). Optional HW: Go over the Naive Bayes computation example on slide 110 and make sure you can do this on your own for any given input record. |
February 22 | Classification and Prediction | Read relevant sections in chapter 6. Reference [2] is an excellent source for more information about artificial neural networks. |
February 25 | HW 3 due at 11pm | Submit it through Blackboard. |
March 1 | No class (Spring Break) | |
March 8 | Classification and Prediction | Read relevant sections in chapter 6. If you are interested in more technical information about SVMs, take a look at SVMoverview.pdf (uploaded on Blackboard). Depending on your math background, some sections might be difficult to understand in detail, but the general idea will be clear. There is no homework this week other than studying for the midterm. |
March 15 | Midterm exam. | Same start time and location as class. |
March 22 | Classification and Prediction; Frequent Patterns | Read relevant sections in chapters 6 and 5. Slides 230-236 and the discussion about Groves represent advanced material for those interested in learning more, but are not required reading for this class. Similarly, the specific formulas for the boosting algorithm are optional reading, but you need to know the basic functionality. |
March 29 | Frequent Patterns | Read relevant sections in chapter 5. Go over the examples for Apriori and FP-growth (look at the textbook for more details) and make sure you can run the algorithms manually on small examples. Compare how FP-growth explores the itemset lattice differently than Apriori. |
April 5 | Frequent Patterns | Read relevant sections in chapter 5. For the example on slide 48 (better: create your own small example), find the maximal and closed frequent itemsets for min_sup=3 and min_sup=1. Practice the computation of lift and discuss why support and confidence might not be good enough in practice. Explain in which order GSP and PrefixSpan explore possible sequences. How do these algorithms differ in the way they are pruning the search space? (Hint: Use the tree of sequences as presented in class, which is similar to the itemset lattice and has 1-item sequences in the first level, 2-item sequences in the second and so on.) What are the main similarities and differences between Apriori and GSP? What are the main similarities and differences between FP-Growth and PrefixSpan? |
April 12 | Clustering | Read relevant sections in chapter 7. For additional information, look at reference [1]. |
April 18 | Project pre-submission due at 11pm | Submit it through Blackboard. |
April 19 | Review, Project discussion, Data Warehousing and OLAP overview | |
April 22 | Project final submission due at 11pm | Submit all files through Blackboard. |
April 26 | Final exam 6-8pm in usual classroom |
Team | Pre-Submission Accuracy | Team | Final Accuracy on testset1 (and testset0) |
---|---|---|---|
8 | 79.521 | 1 | 81.242 (81.13) |
2 | 78.927 | 2 | 80.530 (78.93) |
3 | 78.792 | 5 | 80.046 (79.95) |
1 | 77.667 | 8 | 80.033 (80.01) |
5 | 76.676 | 6 | 80.021 (80.01) |
7 | 76.280 | 3 | 79.923 (79.94) |
4 | 74.945 | 7 | 79.741 (79.72) |
6 | 74.512 | 4 | 79.604 (79.57) |
Team | Latest Accuracy |
---|---|
1 | 80.313 |
6 | 80.006 |
8 | 79.521 |
2 | 78.927 |
3 | 78.792 |
7 | 78.162 |
5 | 78.143 |
4 | 74.945 |
Instructor: Mirek Riedewald
TA: We have no TA this semester :-(
Meeting times: Tue 6 - 9 PM
Meeting location: 108 WVH
CS 5800 or CS 7800, or consent of instructor
Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques, 2nd edition, Morgan Kaufmann, 2006
Recommended books for further reading:
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.