CSG339: Scalable Techniques for Massive Data

CSG 339: Google, eScience, and the Cloud--Scalable Techniques for Massive Data

The "How much Information?" project at UC Berkeley in 2003 estimated that stored information worldwide grew by about 30% a year between 1999 and 2002. Internet search companies, large online retailers, and social networking sites are looking for new approaches for managing their large data collections. At the same time, data-intensive science is emerging as a new paradigm that is concerned with collecting, archiving, and analyzing the vast amounts of data being produced and accumulated by modern science. This includes gigabytes, even petabytes, of data generated by scientific instruments and sensors, human observers, and computer simulations. Turning scientific raw data into knowledge will be the key for future scientific discoveries.

To turn raw data into knowledge and to make it broadly available, requires innovative approaches to scalable data management and data mining. One major difference compared to previous decades is the end of the era of exponentially increasing processor speed. Hence the current trend is towards multi-core and cluster architectures as the main platforms for scalable data processing.

In this course we will discuss recent papers that introduce innovative solutions coming from both industry and academia. Where appropriate, classic papers and landmark results will be presented to provide the necessary background for the latest research results.

Course Information

Instructor: Mirek Riedewald

Meeting times: Tue, Fri 11:45 - 1:25

Requirements

This course is ideally suited for Ph.D. students at all levels, in particular first- and second-year students looking for research topics. There are no specific formal pre-requisites. Knowledge of important database and operating system concepts, e.g., as covered by an undergraduate course, will be helpful but not essential.

Interested Master's students need to contact the instructor prior to enrolling.

Coursework

Each student presents at least one paper in class.
All students submit very brief summaries of selected papers before these papers are discussed in class.
Students work on a class project, either individually or in small groups. These projects can range from well-defined implementation challenges to more exploratory research projects. Projects that result in publications as research or demo papers are explicitly encouraged.

Project Milestones

Monday, 3/16: Project Proposal due by 5PM
Tuesday, 3/17: 5 min project presentation by all teams in class
4/1- 4/3: review of milestone 1 status
Friday, 4/17: draft of Final Project Report, incl. User Doc and Developer Doc, due by 11PM
Wednesday, 4/22: Final Project Report due by 11PM
Friday, 4/24: 15 min final project presentations by all teams during usual class time

Lectures

Jan 6: Google's MapReduce

Luiz Barroso, Jeffrey Dean, and Urs Hoelzle. Web Search for a Planet: The Google Cluster Architecture
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004
Additional information

Hadoop open source MapReduce implementation

Jan 9: Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003

Jan 13: Bigtable

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data. OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006

Jan 16: Sawzall

Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan. Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal. Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 13:4, pp. 227-298

Jan 20: Pig and PigLatin

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins: Pig latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110

Pig open source project

Jan 23: From MapReduce to SQL, MapReduce performance study

Ronnie Chaiken, Bob Jenkins, Per-�ke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, Jingren Zhou: SCOPE: easy and efficient parallel processing of massive data sets. PVLDB 1(2): 1265-1276 (2008)
Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, Douglas Stott Parker Jr.: Map-reduce-merge: simplified relational data processing on large clusters. SIGMOD 2007:1029-1040
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary R. Bradski, Christos Kozyrakis: Evaluating MapReduce for Multi-core and Multiprocessor Systems. HPCA 2007:13-24

Jan 27: Dryad and DryadLINQ

SUMMARY: Summarize the DryadLINQ paper

What is the main idea and contribution of the paper? (1-2 paragraphs)
What are the main strengths? (at most 3, 1 paragraph each)
What are the main weaknesses or drawbacks? (at most 3, 1 paragraph each)
How does DryadLINQ compare to (1) MapReduce, (2) Sawzall, and (3) PigLatin? (1 or 2 paragraphs for each comparison, stating only the main commonalities or differences)

Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. Dryad: distributed data-parallel programs from sequential building blocks, European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008
Additional information

Jan 30: Overview of Cloud computing and the Grid

Brian Hayes. Cloud Computing. Communications of the ACM, Volume 51, Issue 7 (July 2008). Pages 9-11
J. N. Hoover and R. Martin. Demystifying the Cloud. InformationWeek, 2008
I. Foster. The Grid: A New Infrastructure for 21st Century Science. Physics Today, pages 42-47, 2002
H. Stockinger. Defining the grid: a snapshot on the current view. Supercomputing Journal, 42: 3-17, 2007
Additional material:

YouTube videos on Cloud Computing
Grid versus Cloud article
Amazon Elastic Compute Cloud (EC2)
Try "grid computing" and "cloud computing" in Google Trends

Feb 3: Introduction to data mining for eScience

Jim Gray, David T. Liu, Mar�a A. Nieto-Santisteban, Alexander S. Szalay, David J. DeWitt, Gerd Heber: Scientific data management in the coming decade. SIGMOD Record 34(4): 34-41 (2005)
J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, and J. Vandenberg. Data mining the SDSS SkyServer database. In Distributed Data and Structures 4: Records of the 4th International Meeting, pages 189�210, 2002. Carleton Scientific. also as MSR-TR-2002-01
Maria A. Nieto-Santisteban, Alexander S. Szalay, Aniruddha R. Thakar, William J. O'Mullane, Jim Gray, James Annis. When Database Systems Meet The Grid. In Proc. CIDR, 2005
Additional material

Alexander Szalay and Jim Gray. Science in an exponential world. Nature, 440(23), 2006

Feb 6: Tree models

No paper for reading assignment
Brief data mining intro
Decision and regression trees
Tree model variants and ensembles

Feb 10: X-raying complex mining models

W. M. Hochachka, R. Caruana, D. Fink, A. Munson, M. Riedewald, D. Sorokina, and S. Kelling. Data-Mining Discovery of Pattern and Process in Ecological Systems. In Journal of Wildlife Management, 71(7):2427--2437, 2007
Another paper sent by email

Feb 13: Parallel data mining

P1: John C. Shafer, Rakesh Agrawal, Manish Mehta: SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB 1996: 544-555
P2: Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. In NIPS, 2007

Feb 17, 20: Sequence mining

SUMMARY: Summarize the PrefixSpan paper

What is the main idea and contribution of the paper? (1-2 paragraphs)
What are the main strengths? (at most 3, 1 paragraph each)
What are the main weaknesses or drawbacks? (at most 3, 1 paragraph each)
If you had to find frequent sequence patterns in the trading prices of natural resources (oil, gas, gold etc.), would you use PrefixSpan or one of the algorithms from the Agrawal/Srikant paper? Briefly explain why.

P3: Rakesh Agrawal, Ramakrishnan Srikant: Mining Sequential Patterns. ICDE 1995: 3-14
P4: Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen, Umeshwar Dayal, Meichun Hsu: Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. IEEE Trans. Knowl. Data Eng. 16(11): 1424-1440 (2004)
P5: A. Lachmann and M. Riedewald. Finding Relevant Patterns in Bursty Sequences. In Proc. of the VLDB Endowment (PVLDB), 1(1):78-89, 2008

Feb 24, 27, Mar 10: Parallel databases

P6: David DeWitt and Jim Gray. Parallel database systems: The future of high performance database systems. Communications of the ACM, 36(6), 1992
P7: D. DeWitt, et al. The Gamma Database Machine Project, TKDE 2(1), 1990 (download from IEEE Xplore, through NEU library)
P8: Goetz Graefe: Encapsulation of Parallelism in the Volcano Query Processing System. SIGMOD Conference 1990: 102-111
P9: Chaitanya Baru and Gilles Fecteau. An overview of DB2 parallel edition. In SIGMOD, pages 460�462, 1995
P9: David Dewitt and Michael Stonebraker. MapReduce: A major step backwards. The Database Column, January 17, 2008
P9: David Dewitt and Michael Stonebraker. MapReduce II. The Database Column, January 25, 2008

Mar 13: Distributed DBMS fault tolerance

P10: Dale Skeen, Michael Stonebraker: A Formal Model of Crash Recovery in a Distributed System. IEEE Trans. Software Eng. 9(3): 219-228 (1983)
P11: C. Mohan, Bruce G. Lindsay, Ron Obermarck: Transaction Management in the R* Distributed Database Management System. ACM Trans. Database Syst. 11(4): 378-396 (1986)

Mar 17: Consensus in distributed systems

P12: Leslie Lamport, Robert Shostak, and Marshall Pease. The Byzantine generals problem. ACM Transactions on Programming Languages and Systems Volume 4, Issue 3 (July 1982), 382-401
Overview presentations by project teams

Mar 20: Classic DB solutions and new consistency notions

P13: Jim Gray, Pat Helland, Patrick E. O'Neil, Dennis Shasha: The Dangers of Replication and a Solution. SIGMOD Conference 1996: 173-182
P14: Dan Pritchett. BASE: An ACID Alternative. ACM Queue, Vol. 6, No. 3 - May/June 2008
P14: Werner Vogels. Eventually Consistent. ACM Queue, Vol. 6, No. 6 - October 2008
P14: Seth Gilbert, Nancy Lynch. Brewer�s conjecture and the feasibility of consistent, available, partition-tolerant web services. ACM SIGACT News, 2002

Mar 24: Amazon's technology

P16: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels. Dynamo: Amazon�s Highly Available Key-value Store. SOSP�07, October 14�17, 2007, Stevenson, Washington, USA
P17: Matthias Brantner, Daniela Florescu, David A. Graf, Donald Kossmann, Tim Kraska: Building a database on S3. SIGMOD Conference 2008: 251-264
Additional material:

Amazon S3: http://aws.amazon.com/s3/
Werner Vogels' blog

Mar 27: Potpourri (due to re-scheduling)

P15: Tushar Chandra, Robert Griesemer, and Joshua Redstone. Paxos Made Live � An Engineering Perspective. PODC '07: 26th ACM Symposium on Principles of Distributed Computing, 2007
Additional material:

Leslie Lamport. Paxos Made Simple. ACM SIGACT News (Distributed Computing Column) 32, 4, pages 51-58, 2001

P20: Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christos Karamanolis, Sinfonia: a new paradigm for building scalable distributed systems, ACM SIGOPS Operating Systems Review, v.41 n.6, December 2007

Mar 31: Yahoo! technology

P18: Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni. PNUTS: Yahoo!'s Hosted Data Serving Platform. VLDB Conference (industry track), Auckland, New Zealand, 2008
P19: Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee, Raghu Ramakrishnan and Ramana Yerneni. Efficient Bulk Insertion into a Distributed Ordered Table. ACM SIGMOD Conference, Vancouver, BC, Canada, 2008

Apr 3: Clustera and Sinfonia

P21: David J. DeWitt, Erik Paulson, Eric Robinson, Jeffrey F. Naughton, Joshua Royalty, Srinath Shankar, Andrew Krioukov: Clustera: an integrated computation and data management system. PVLDB 1(1): 28-41 (2008)
P22: Ben Vandiver, Hari Balakrishnan, Barbara Liskov, Samuel Madden: Tolerating byzantine faults in transaction processing systems using commit barrier scheduling. SOSP 2007: 59-72

Apr 7: Managing models in a DBMS

P23: Amol Deshpande, Samuel Madden: MauveDB: supporting model-based user views in database systems. SIGMOD Conference 2006: 73-84
P24: Arvind Thiagarajan, Samuel Madden: Querying continuous functions in a database system. SIGMOD Conference 2008: 791-804

Apr 10: Probabilistic DBMS

P25: Nilesh Dalvi, Chris Re, Dan Suciu. Probabilistic Databases: Diamonds in the Dirt. Unpublished manuscript
P26: Douglas Burdick, AnHai Doan, Raghu Ramakrishnan, Shivakumar Vaithyanathan: OLAP over Imprecise Data with Domain Constraints. VLDB 2007: 39-50

Apr 14: Other interesting DB trends

P27: Michael Stonebraker, Samuel Madden, Daniel Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. "The End of an Architectural Era (It's Time for a Complete Rewrite)." In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB), Vienna, Austria, September 2007
P28: Stavros Harizopoulos, Daniel Abadi, Samuel Madden, and Michael Stonebraker. "OLTP Through the Looking Glass, and What We Found There." In Proceedings of the ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, June 2008

Apr 17: Other interesting DB trends (cont.)

P29: Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, Raghu Ramakrishnan: Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach. VLDB 2007: 399-410
P30: Walker M. White, Alan J. Demers, Christoph Koch, Johannes Gehrke, Rajmohan Rajagopalan: Scaling games to epic proportions. SIGMOD Conference 2007: 31-42