CS6240: Parallel Data Processing in MapReduce

CS 6240: Parallel Data Processing in MapReduce

This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.

News

Link to Piazza discussion forum: https://piazza.com/northeastern/spring2013/cs6240/home

Acknowledgment: This course was kindly supported by an AWS in Education Coursework Grant award from Amazon.com, Inc.

[04/08/2013] All lecture audio and slides up to Apr 2 lecture on Blackboard

Lectures

(Future lectures and events are tentative.)

Date	Topic	Remarks and Reading Assignments
Jan 8	Introduction, simple algorithms, measures of success
Jan 15	Performance measures, Amdahl's law, MapReduce overview, word count, combiner, partitioner	Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works.
Jan 22	Failure handling, equi-join, reverse Web graph, inverted index, anatomy of a MapReduce execution, Google File System	Read the relevant chapters in White's book: 1. Meet Hadoop, 2. MapReduce, 3. The Hadoop Distributed File System, 4. Hadoop I/O. Read the Google File System paper.
Jan 29	Hadoop specifics, sorting, in-mapper combining, Pairs and Stripes	Read the relevant chapters in White's book: 5. Developing a M-R Application, 6. How M-R Works, 7. M-R Types and Formats, 8. M-R Features. Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Feb 5	Relative frequencies through order inversion, secondary sort through value-to-key conversion, Pig and PigLatin	Read the relevant chapters in White's book: 11. Pig. Consult the Lin/Dyer and the Miner/Shook books about the design patterns.
Feb 12	Relational databases, HBase, Hive	Read the relevant chapters in White's book: 12. Hive, 13. HBase. For more details about HBase, consult the George book.
Feb 19	HW 2 discussion, Graph algorithms: single source shortest path, PageRank introduction	Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration.
Feb 26	PageRank, data mining in MapReduce: sampling, clustering	Read more about PageRank here. For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 5	No class: Spring Break
Mar 12	Data mining in MapReduce: classification and prediction	For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks.
Mar 19	Midterm exam	Same time and location as lecture.
Mar 26	Discussion of midterm solutions; ensemble predictions and how to cover all combinations; matrix multiplication and machine learning	For more information about machine learning techniques that rely of matrix manipulations read this paper.
Apr 2	Testing and tuning; theta joins in MapReduce	Read more about testing and tuning in White's book. The theta-join technique is discussed in our paper.
Apr 9	Project progress presentations
Apr 16
Apr 23	Project presentations

Course Information

Instructor: Mirek Riedewald

Office hours: Tuesday 4-5:30pm in 332 WVH
Send email (including the TA) to set up an appointment if you cannot make it during these times.

TA: Moonyoung Kang

Office hours: Friday 1-4pm in 472 WVH

Meeting times: Tue 6 - 9 PM
Meeting location: 210 Shillman Hall

Prerequisites

CS 5800 or CS 7800, or consent of instructor

Grading

Homework/project: 50%
Midterm exam: 30%
Participation: 10%
Review quizzes: 10%

Reading Materials

"Hadoop: The Definitive Guide" by Tom White, 3rd edition. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
"MapReduce Design Patterns" by Donald Miner and Adam Shook (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
"Hadoop in Practice" by Alex Holmes (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
"Hadoop in Action" by Chuck Lam (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
"Data-Intensive Text Processing with MapReduce" by Jimmy Lin and Chris Dyer. (Available online, see http://www.umiacs.umd.edu/~jimmylin/book.html for info.)
"HBase: The Definitive Guide" by Lars George. (Available from Safari Books Online at http://0-proquest.safaribooksonline.com.ilsprod.lib.neu.edu/.)
Check out Yahoo!'s Hadoop tutorial for additional information. Notice that it uses the old MapReduce API.

Academic Integrity Policy

A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.

For more information, please refer to the Academic Integrity Web page.