CSG 339: Google, eScience, and the Cloud--Scalable Techniques for Massive
Data
The "How much Information?" project at UC Berkeley in 2003 estimated that stored information worldwide grew by about 30% a year between 1999 and
2002. Internet search companies, large online retailers, and social networking
sites are looking for new approaches for managing their large data collections.
At the same time, data-intensive science is emerging as a new paradigm that is
concerned with collecting, archiving, and analyzing the vast amounts of data
being produced and accumulated by modern science. This includes gigabytes, even petabytes, of data
generated by scientific instruments and sensors, human observers, and computer
simulations. Turning scientific raw
data into knowledge will be the key for future scientific discoveries.
To turn raw data into knowledge and to make it broadly available, requires
innovative approaches to scalable data management and data mining. One major
difference compared to previous decades is the end of the era of exponentially
increasing processor speed. Hence the current trend is towards multi-core and cluster
architectures as the main platforms for scalable data processing.
In this course we will discuss recent papers that introduce innovative
solutions coming from both industry and academia. Where appropriate, classic
papers and landmark results will be presented to provide the necessary background for the
latest research results.
Course Information
Instructor: Mirek
Riedewald
Meeting times: Tue, Fri 11:45 - 1:25
Requirements
This course is ideally suited for Ph.D. students at all levels,
in particular first- and second-year students looking for research topics. There are no specific formal pre-requisites. Knowledge of important database
and operating system concepts, e.g., as covered by an undergraduate course, will
be helpful but not essential.
Interested Master's students need to contact the instructor prior to
enrolling.
Coursework
- Each student presents at least one paper in class.
- All students submit very brief summaries of selected papers before
these papers are discussed in class.
- Students work on a class project, either individually or in small
groups. These projects can range from well-defined implementation challenges
to more exploratory research projects. Projects that result in publications
as research or demo papers are explicitly encouraged.
Project Milestones
- Monday, 3/16: Project Proposal due by 5PM
- Tuesday, 3/17: 5 min project presentation by all teams in class
- 4/1- 4/3: review of milestone 1 status
- Friday, 4/17: draft of Final Project Report, incl. User Doc and
Developer Doc, due by 11PM
- Wednesday, 4/22: Final Project Report due by 11PM
- Friday, 4/24: 15 min final project presentations by all teams during
usual class time
Lectures
Jan 6: Google's MapReduce
Jan 9: Google File System
- Sanjay Ghemawat,
Howard Gobioff, and
Shun-Tak Leung.
The Google File System.
19th ACM Symposium on Operating Systems Principles, Lake George, NY,
October, 2003
Jan 13: Bigtable
- Fay Chang, Jeffrey Dean,
Sanjay Ghemawat, Wilson
C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes,
and Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data.
OSDI'06: Seventh Symposium on Operating System Design and Implementation,
Seattle, WA, November, 2006
Jan 16: Sawzall
Jan 20: Pig and PigLatin
Jan 23: From MapReduce to SQL, MapReduce performance study
- Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib,
Simon Weaver, Jingren Zhou:
SCOPE: easy and efficient parallel processing of massive data sets.
PVLDB 1(2): 1265-1276 (2008)
- Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, Douglas Stott Parker Jr.:
Map-reduce-merge: simplified relational data processing on large clusters.
SIGMOD 2007:1029-1040
- Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary R. Bradski,
Christos Kozyrakis:
Evaluating MapReduce for Multi-core and Multiprocessor Systems. HPCA
2007:13-24
Jan 27: Dryad and DryadLINQ
- SUMMARY: Summarize the DryadLINQ paper
- What is the main idea and contribution of the paper? (1-2 paragraphs)
- What are the main strengths? (at most 3, 1 paragraph each)
- What are the main weaknesses or drawbacks? (at most 3, 1 paragraph each)
- How does DryadLINQ compare to (1) MapReduce, (2) Sawzall, and (3)
PigLatin? (1 or 2 paragraphs for each comparison, stating only the main
commonalities or differences)
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly.
Dryad: distributed data-parallel programs from sequential building blocks,
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March
21-23, 2007
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson,
Pradeep Kumar Gunda, and Jon Currey.
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing
Using a High-Level Language. Symposium on Operating System Design and
Implementation (OSDI), San Diego, CA, December 8-10, 2008
- Additional information
Jan 30: Overview of Cloud computing and the Grid
Feb 3: Introduction to data mining for eScience
- Jim Gray, David T. Liu, María A. Nieto-Santisteban, Alexander S. Szalay,
David J. DeWitt, Gerd Heber:
Scientific data management in the coming decade. SIGMOD Record 34(4):
34-41 (2005)
- J. Gray, A.S. Szalay, A. Thakar, P. Kunszt, C. Stoughton, D. Slutz, and
J. Vandenberg.
Data mining the SDSS SkyServer database. In Distributed Data
and Structures 4: Records of the 4th International Meeting, pages 189–210, 2002. Carleton Scientific. also as MSR-TR-2002-01
- Maria A. Nieto-Santisteban, Alexander S. Szalay, Aniruddha R. Thakar,
William J. O'Mullane, Jim Gray, James Annis.
When
Database Systems Meet The Grid. In Proc. CIDR, 2005
- Additional material
Feb 6: Tree models
- No paper for reading assignment
- Brief data mining intro
- Decision and regression trees
- Tree model variants and ensembles
Feb 10: X-raying complex mining models
Feb 13: Parallel data mining
Feb 17, 20: Sequence mining
- SUMMARY: Summarize the PrefixSpan paper
- What is the main idea and contribution of the paper? (1-2 paragraphs)
- What are the main strengths? (at most 3, 1 paragraph each)
- What are the main weaknesses or drawbacks? (at most 3, 1 paragraph each)
- If you had to find frequent sequence patterns in the trading prices of
natural resources (oil, gas, gold etc.), would you use PrefixSpan or one of
the algorithms from the Agrawal/Srikant paper? Briefly explain why.
- P3: Rakesh Agrawal, Ramakrishnan Srikant:
Mining Sequential Patterns. ICDE 1995: 3-14
- P4: Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto,
Qiming Chen, Umeshwar Dayal, Meichun Hsu:
Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach.
IEEE Trans. Knowl. Data Eng. 16(11): 1424-1440 (2004)
- P5: A. Lachmann and M. Riedewald.
Finding Relevant Patterns in Bursty Sequences. In
Proc. of the VLDB
Endowment (PVLDB), 1(1):78-89, 2008
Feb 24, 27, Mar 10: Parallel databases
- P6: David DeWitt and Jim Gray.
Parallel
database systems: The future of high performance database systems.
Communications of the ACM, 36(6), 1992
- P7: D. DeWitt, et al. The
Gamma Database Machine Project, TKDE 2(1), 1990 (download from IEEE Xplore,
through NEU library)
-
P8: Goetz Graefe:
Encapsulation of Parallelism in the Volcano Query Processing System.
SIGMOD Conference 1990: 102-111
- P9: Chaitanya Baru and Gilles Fecteau.
An overview of DB2 parallel edition.
In SIGMOD, pages 460–462, 1995
- P9: David Dewitt and Michael Stonebraker.
MapReduce: A major step backwards. The Database Column, January 17, 2008
- P9: David Dewitt and Michael Stonebraker.
MapReduce II. The Database Column, January 25, 2008
Mar 13: Distributed DBMS fault tolerance
Mar 17: Consensus in distributed systems
- P12: Leslie Lamport, Robert Shostak, and Marshall Pease.
The Byzantine generals problem. ACM Transactions on Programming
Languages and Systems Volume 4, Issue 3 (July 1982), 382-401
- Overview presentations by project teams
Mar 20: Classic DB solutions and new consistency notions
- P13: Jim Gray, Pat Helland, Patrick E. O'Neil, Dennis Shasha:
The Dangers of Replication and a Solution. SIGMOD Conference 1996:
173-182
- P14: Dan Pritchett.
BASE: An ACID Alternative. ACM Queue, Vol. 6, No. 3
- May/June 2008
- P14: Werner Vogels.
Eventually Consistent.
ACM Queue, Vol. 6, No. 6 - October 2008
- P14: Seth Gilbert, Nancy Lynch.
Brewer’s conjecture and the feasibility of consistent, available,
partition-tolerant web services. ACM SIGACT News, 2002
Mar 24: Amazon's technology
- P16: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian,
Peter Vosshall and Werner Vogels.
Dynamo: Amazon’s Highly Available Key-value Store. SOSP’07, October
14–17, 2007, Stevenson, Washington, USA
- P17: Matthias Brantner, Daniela Florescu, David A. Graf, Donald Kossmann,
Tim Kraska:
Building a
database on S3. SIGMOD Conference 2008: 251-264
- Additional material:
Mar 27: Potpourri (due to re-scheduling)
- P15: Tushar Chandra, Robert Griesemer, and Joshua Redstone.
Paxos Made Live
– An Engineering Perspective. PODC '07: 26th ACM Symposium on Principles
of Distributed Computing, 2007
- Additional material:
- Leslie Lamport.
Paxos Made Simple. ACM SIGACT News (Distributed Computing Column)
32, 4, pages 51-58, 2001
- P20: Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch,
Christos Karamanolis,
Sinfonia: a
new paradigm for building scalable distributed systems, ACM SIGOPS
Operating Systems Review, v.41 n.6, December 2007
Mar 31: Yahoo! technology
- P18: Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam
Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver
and Ramana Yerneni.
PNUTS: Yahoo!'s Hosted Data Serving Platform. VLDB Conference (industry
track), Auckland, New Zealand, 2008
- P19: Adam Silberstein, Brian F. Cooper, Utkarsh Srivastava, Erik Vee,
Raghu Ramakrishnan and Ramana Yerneni.
Efficient Bulk
Insertion into a Distributed Ordered Table. ACM SIGMOD Conference,
Vancouver, BC, Canada, 2008
Apr 3: Clustera and Sinfonia
Apr 7: Managing models in a DBMS
Apr 10: Probabilistic DBMS
Apr 14: Other interesting DB trends
- P27: Michael Stonebraker, Samuel Madden, Daniel Abadi, Stavros
Harizopoulos, Nabil Hachem, and Pat Helland. "The
End of an Architectural Era (It's Time for a Complete Rewrite)." In
Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB),
Vienna, Austria, September 2007
- P28: Stavros Harizopoulos, Daniel Abadi, Samuel Madden, and Michael
Stonebraker. "OLTP
Through the Looking Glass, and What We Found There." In Proceedings
of the ACM SIGMOD International Conference on Management of Data,
Vancouver, BC, Canada, June 2008
Apr 17: Other interesting DB trends (cont.)
- P29: Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, Raghu Ramakrishnan:
Building Structured
Web Community Portals: A Top-Down, Compositional, and Incremental Approach.
VLDB 2007: 399-410
- P30: Walker M. White, Alan J. Demers, Christoph Koch, Johannes Gehrke,
Rajmohan Rajagopalan:
Scaling
games to epic proportions. SIGMOD Conference 2007: 31-42