CS 6240: Large-Scale Parallel Data Processing

I make all my course material available for free. Please use it according to the license (see license in each module) and acknowledge the source.

Note that the material below is organized by the week it will be discussed in the lectures. You are generally expected to have read it (and submitted the self-check quizzes) during the preceding week. Of course, this does not apply to week 1. The general expectation is not that you know everything, but that you acquire some basic familiarity with the main topics. Focus on answering the self-check quizzes yourself. We will then cover the most important material in detail during the lecture.

Week 1

Introduction (read ASAP in week 1)

Parallel Processing Basics (read ASAP in week 1)

Week 2

Introduction to Distributed Services (read by end of week 1)

Distributed File System (read by end of week 1)

Resource and Application Management (read by end of week 1)

Week 3

Overview of MapReduce and Spark (read by end of week 2)

Week 4

Joins (read by end of week 3)

Week 5

Fundamental Techniques (read by end of week 4)

Week 6

Common Algorithm Building Blocks (read by end of week 5)

Week 7

Graph Algorithms (read by end of week 6)

Week 8

Data Mining 1: K-means, decision trees (read by end of week 7)

Week 9

Data Mining 2: Ensembles (read by end of week 8)

Week 10

Intelligent Partitioning (read by end of week 9)

Week 11

More about Spark (read by end of week 10)

Week 12

Exam

Week 13

Beyond MapReduce and Spark: CAP, HBase, and Hive (you may read this after the discussion in class)

Week 14

Varying topics, depending on student interest and progress in earlier weeks. Default: Relevant material from relational (distributed) databases and/or theoretical models for distributed computation.

Week 15

During the last week, you complete the project presentation.