This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.
Link to Piazza discussion forum: https://piazza.com/northeastern/fall2012/cs6240/home
Acknowledgment: This course was kindly supported by an AWS in Education Coursework Grant award from Amazon.com, Inc.
[12/11/2012] Lecture audio updated
(Future lectures and events are tentative.)
Date | Topic | Remarks and Reading Assignments |
Sep 11 | Introduction, simple algorithms, measures of success | |
Sep 18 | MapReduce, word count, equi-join, handling failures | Read the Google MapReduce paper. Look carefully at the word count and equi-join examples and make sure you can explain how the computation works. |
Sep 25 | Reverse Web graph, inverted index, sorting, Google File System | Read the relevant chapters in White's book. Read the Google File System paper. |
Oct 2 | Hadoop specifics, MapReduce Design Patterns | Read the relevant chapters in White's book. Read the appropriate sections in the Lin/Dyer book (see below). Try to re-write the word count example so that it uses the Local Aggregation design pattern. |
Oct 9 | Design Patterns | Read the appropriate sections in the Lin/Dyer book (see below). Go through the Order Inversion design pattern in detail by using an example like the relative bird color counts we discussed in class. |
Oct 16 | Design Patterns, Theta-Joins in MapReduce | Read the appropriate sections in the Lin/Dyer book (see below). For the joins, take a look at our paper. |
Oct 23 | Graph Algorithms | Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. |
Oct 30 | Graph Algorithms; Pig | Read the Pig paper and the corresponding chapter in the Tom White book. |
Nov 6 | Midterm Exam | Same time and location as lecture. |
Nov 13 | HW 2 discussion; Databases | |
Nov 20 | Project and midterm discussion; Databases, HBase, and Hive; Reducing Map-to-Reduce data transfer | Read more about HBase and Hive in the books by Tom White and Lars George (see below). |
Nov 27 | Project progress presentations | |
Dec 4 | MapReduce for Machine Learning; Parallel Computing Landscape | Read more about MapReduce for machine learning in this paper. LLNL has a good overview of high-performance computing and MPI. |
Dec 11 | Final project presentations |
Instructor: Mirek Riedewald
TA: Alper Okcan
Meeting times: Tue 6 - 9 PM
Meeting location: 425 Shillman Hall
CS 5800 or CS 7800, or consent of instructor
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.