This course covers techniques for analyzing very large data sets. We introduce the MapReduce programming model and the core technologies it relies on in practice, such as a distributed file system. Related approaches and technologies from distributed databases and Cloud Computing will also be introduced. Particular emphasis is placed on practical examples and hands-on programming experience. Both plain MapReduce and database-inspired advanced programming models running on top of a MapReduce infrastructure will be used.
Link to Piazza discussion forum: https://piazza.com/northeastern/spring2014/cs6240/home
Acknowledgment: This course was kindly supported by an AWS in Education Grant award from Amazon.com, Inc.
[04/11/2014] Reminder: no regular class on 4/15. Instead we have a double
class on 4/22, from 11:30am until 4:35pm in our regular lecture hall.
[04/11/2014] All slides and audio
of all lectures are now on
Blackboard.
(Future lectures and events are tentative.)
Date | Topic | Remarks and Reading Assignments |
Jan 7 | Syllabus and overview; introduction; simple algorithms; measures of success; Amdahl's Law | Read more about data centers and "data center as a computer" here. |
Jan 14 | MapReduce overview: distributed file system, Word Count, anatomy of a MapReduce execution, partitioner, failure handling, Hadoop specifics | Read the Google File System paper. Read the Google MapReduce paper. Look carefully at the word count example and make sure you can explain how the computation works. For a detailed discussion, consult the relevant chapters in White's book. For a more compact discussion, consult the Lin/Dyer book. |
Jan 21 | Fundamental techniques: combiner and in-mapper combining, sorting, secondary sort | Make sure you can explain in detail how the sorting algorithm works. For a detailed discussion about sorting, consult the relevant chapters in White's book. For in-mapper combining and secondary sort, consult the Lin/Dyer book. |
Jan 28 | Algorithm examples and helper functions (order inversion, sampling, quantiles etc.) | Consult the Miner/Shook book about some of the helper functions discussed. |
Feb 4 | More algorithm examples (equi-join); Pig and PigLatin | Consult the Miner/Shook book about some of the algorithms discussed. Consult the Lin/Dyer and the Miner/Shook books about the design patterns. Read the following chapter in White's book: 11. Pig. |
Feb 11 | Relational databases; CAP; HBase; Hive | Take a look at the appropriate chapters in [M. Tamer Ozsu and Patrick Valduriez. Principles of Distributed Database Systems. Springer, 2011. Third edition.] to learn more about relational databases in a distributed context. Read the following chapters in White's book: 12. Hive, 13. HBase. For more details about HBase, consult the George book. |
Feb 18 | Graph algorithms | Read the appropriate sections in the Lin/Dyer book. Create a small example graph and manually run the MapReduce programs on the example to better understand what happens in each iteration. Read more about PageRank here. |
Feb 25 | Intelligent partitioning: Pairs and Stripes, theta-joins | Read more about Pairs and Stripes in the Lin/Dyer book. The theta-join technique is discussed in our research paper. |
Mar 4 | No class: Spring Break | |
Mar 11 | Midterm exam | Same time and location as lecture. |
Mar 18 | Data mining in MapReduce (clustering, classification) | For more information about data mining, check out my CS 6220 page. There are slides summarizing various mainstream data mining approaches and a list of recommended textbooks. |
Mar 25 | Data mining in MapReduce (ensemble methods, regression, matrix manipulation for machine learning) | For more information about machine learning techniques that rely on matrix manipulations read this paper. |
Apr 1 | Testing, tuning, and analysis; case studies: search log analysis, HBase for indexing/sorting | Read more about testing and tuning in White's book. |
Apr 8 | Classic view of parallel computing vs. MapReduce | Take a look at the parallel computing tutorial by LLNL. There are similar tutorials about MPI and OpenMP. |
Apr 15 | No class: Moved to Apr 22 | |
Apr 22 | Project presentations | Double class: 11:30am to 4:35pm |
Instructor: Mirek Riedewald
TAs:
Meeting times: Tue 1:35 - 4:35 PM
Meeting location: check registrar system for up-to-date info
CS 5800 or CS 7800, or consent of instructor
Safari Books Online at NEU: http://proquest.safaribooksonline.com.ezproxy.neu.edu/ (might have changed in the meantime)
A commitment to the principles of academic integrity is essential to the mission of Northeastern University. The promotion of independent and original scholarship ensures that students derive the most from their educational experience and their pursuit of knowledge. Academic dishonesty violates the most fundamental values of an intellectual community and undermines the achievements of the entire University.
For more information, please refer to the Academic Integrity Web page.