This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.
Link to Piazza discussion forum: https://piazza.com/northeastern/spring2013/cs6240/home
Acknowledgment: This course was kindly supported by an AWS in Education Coursework Grant award from Amazon.com, Inc.
[04/08/2013] All lecture audio and slides up to Apr 2 lecture on Blackboard
(Future lectures and events are tentative.)
Date | Topic | Remarks and Reading Assignments |
Jan 8 | Introduction, simple algorithms, measures of success | |
Jan 15 | Performance measures, Amdahl's law, MapReduce overview, word count, combiner, partitioner | Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works. |
Jan 22 | Failure handling, equi-join, reverse Web graph, inverted index, anatomy of a MapReduce execution, Google File System | Read the relevant chapters in White's book: 1. Meet
Hadoop, 2. MapReduce, 3. The Hadoop Distributed File System, 4. Hadoop
I/O. Read the Google File System paper. |
Jan 29 | Hadoop specifics, sorting, in-mapper combining, Pairs and Stripes | Read the relevant chapters in White's book: 5.
Developing a M-R Application, 6. How M-R Works, 7. M-R Types and
Formats, 8. M-R Features. Consult the Lin/Dyer and the Miner/Shook books about the design patterns. |
Feb 5 | Relative frequencies through order inversion, secondary sort through value-to-key conversion, Pig and PigLatin | Read the relevant chapters in White's book: 11. Pig. Consult the Lin/Dyer and the Miner/Shook books about the design patterns. |
Feb 12 | Relational databases, HBase, Hive | Read the relevant chapters in White's book: 12. Hive, 13. HBase. For more details about HBase, consult the George book. |
Feb 19 | HW 2 discussion, Graph algorithms: single source shortest path, PageRank introduction | Read the appropriate sections in the Lin/Dyer book (see below). Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. |
Feb 26 | PageRank, data mining in MapReduce: sampling, clustering | Read more about PageRank here. For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks. |
Mar 5 | No class: Spring Break | |
Mar 12 | Data mining in MapReduce: classification and prediction | For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks. |
Mar 19 | Midterm exam | Same time and location as lecture. |
Mar 26 | Discussion of midterm solutions; ensemble predictions and how to cover all combinations; matrix multiplication and machine learning | For more information about machine learning techniques that rely of matrix manipulations read this paper. |
Apr 2 | Testing and tuning; theta joins in MapReduce | Read more about testing and tuning in White's book. The theta-join technique is discussed in our paper. |
Apr 9 | Project progress presentations | |
Apr 16 | ||
Apr 23 | Project presentations |
Instructor: Mirek Riedewald
TA: Moonyoung Kang
Meeting times: Tue 6 - 9 PM
Meeting location: 210 Shillman Hall
CS 5800 or CS 7800, or consent of instructor
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.