This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.
[12/01/2011] Materials from Nov 30 lecture posted
[11/10/2011] Materials from Nov 9 lecture posted
[11/03/2011] Slides and audio from Nov 2 lecture posted
[10/20/2011] Audio from Oct 19 lecture posted on Blackboard
Larger version of slides (2 per page)
(Future lectures and events are tentative.)
Date | Topic | Remarks and Reading Assignments |
September 7 | Introduction and first parallel algorithms | |
September 14 | More parallel algorithms; MapReduce | Read the Google MapReduce paper. Look carefully at the word count and equi-join examples and make sure you can explain how the computation works. |
September 21 | MapReduce algorithm examples; handling failures | Go over all the MapReduce algorithms we discussed in class and read the discussion about sorting in chapter 8 of the Tom White book. Then try to write the pseudo-code for Map and Reduce for all problems without looking at the lecture notes. Finally, look at the Grep.java and Sort.java examples that come with the Hadoop distribution. Match the Java code with your pseudo-code and execute it for some example data. |
September 26 | HW 1 due at 11pm | Submit it through Blackboard. |
September 28 | MapReduce; Google File System; Hadoop specifics | Read the Google File System paper. Read chapters 1, 2, 3, 4, 5, 6, and 7 in the Tom White book. |
October 5 | Pig; MapReduce design patterns | Read the Pig paper. Read chapters 8 and 11 in the Tom White book. |
October 12 | MapReduce design patterns; joins | Read the appropriate sections in the Lin/Dyer book (see below). For the joins, take a look at our paper. |
October 19 | Joins | Read our paper. |
October 20 | HW 2 due at 11pm | |
October 26 | Midterm exam (6-8pm in usual classroom) | |
November 2 | Graph algorithms | Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce program on the example to better understand what happens in each iteration. |
November 3 | Project proposals due at 11pm | |
November 9 | Dryad; databases | Read the papers on Dryad, DryadLINQ, and parallel databases. For more information about transactions, consult any standard database textbook. |
November 16 | Project progress presentations in class | |
November 23 | No class: Thanksgiving. | |
November 30 | GPU computing (by Perhaad Mistry); Parallel computing classics | It is important that you go through this excellent tutorial on parallel computing from LLNL. Also make sure you read this overview article on GPU computing. Perhaad's research is discussed in more detail in this paper. And MPI and OpenMP are discussed in two other nice tutorials at LLNL. |
December 2 | Project reports due at 11pm | |
December 7 | Final lecture | Project presentations in class |
December 14 | Final exam 6-8pm in usual classroom |
Instructor: Mirek Riedewald
TA: no TA
Meeting times: Wed 6 - 9 PM
Meeting location: Ryder Hall 429
CS 5800 or CS 7800, or consent of instructor
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.