This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.
Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.
(Future lectures and events are tentative.)
Week starting: | Topic | Remarks |
Sep 1 |
Syllabus Overview: data and harware trends Cloud computing |
|
Sep 8 |
Scalability and metrics Amdahl's Law Google File System, Hadoop's HDFS |
Assignment 1 out. Due 9/21. |
Sep 15 | MapReduce and Hadoop | |
Sep 22 |
Fundamental Techniques (Includes: in-mapper combining, sorting, secondary sorting) |
Assignment 2 out. Due Oct 5. |
Sep 29 |
Basic Algorithms (Includes: order inversion, per-record computation, group-by, global counters, random sampling and shuffling, quantiles, top-k) |
|
Oct 6 |
Basic Algorithms: Advanced (Includes: reduce-side join, replicated join, semi-join with Bloom filter) |
Assignment 3 out. Due Oct 19. |
Oct 13 | Pig and Pig Latin | |
Oct 20 | Relational Databases | Assignment 4 out. Due Nov 2. |
Oct 27 |
CAP theorem HBase Hive |
Project starts: team forming, proposal. Due Nov 9. |
Nov 3 | Midterm exam | |
Nov 10 |
Graph Algorithms (Includes: single source shortest path, PageRank) |
Project progress report assignment out. Due Nov 23. |
Nov 17 |
Intelligent Partitioning (Includes: Pairs and Stripes, theta-join) |
|
Nov 24 | Data Mining 1: clustering, classification | Project final report assignment out. Due Dec 7. Project presentation assignment out. Due Dec 8. |
Dec 1 | Data Mining 2: ensemble methods, regression, matrix manipulation | |
Dec 8 | Project presentations | Same time and location as lecture. |
Instructor: Mirek Riedewald
TAs:
Meeting times and location: check registrar system for up-to-date info
CS 5800 or CS 7800, or consent of instructor
Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.