Language and Runtime Support for Large-Scale Data Analytics
As the cost of computing and communication resources has plummeted, applications have become data-centric with data products growing explosively.
Lead PI
Abstract
As the cost of computing and communication resources has plummeted, applications have become data-centric with data products growing explosively in both number and size. Although accessing such data using the compute power necessary for its analysis and processing is cheap and readily available via cloud computing (intuitive, utility-style access to vast resource pools), doing so currently requires significant expertise, experience, and time (for customization, configuration, deployment, etc).
This work investigates new models of cloud computing that combine domain-targeted languages with scalable data processing, sharing, and management abstractions within a distributed service platform that “scales” programmer productivity. To enable this, this research explores new programming language, runtime, and distributed systems techniques and technologies that integrate the R programming language environment with open source cloud platform-as-a-service (PaaS) in ways that simplify processing massive datasets, sharing datasets across applications and users, and tracking and enforcing data provenance. The PIs’ plans for research, outreach, integrated curricula, and open source release of research artifacts have the potential for making cloud computing more accessible to a much wider range of users: The data analytics community who use the R statistical analysis environment to apply their techniques and algorithms to important problems in areas such as biology, chemistry, physics, political science and finance, by enabling them to use cloud resources transparently for their analyses, and to share their scientific data/results in a way that enables others to reproduce and verify them.
Funding
Related Publications
- DeVito, Hegarty, Aiken, Hanrahan, Vitek. “Terra: a multi-stage language for high-performance computing.” PLDI ’13 Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013 , p.105. DOI: 10.1145/2491956.2462166
- Kalibera, Maj, Morandat, Vitek. “A Fast Abstract Syntax Tree Interpreter for R.” VEE ’14 Proceedings of the 10th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, 2015, p.89. DOI: 10.1145/2576195.2576205
- Kalibera, Mole, Jones, Vitek. “A black-box approach to understanding concurrency in DaCapo.” OOPSLA ’12 Proceedings of the ACM international conference on Object oriented programming systems languages and applications, 2012, p.335. DOI: 10.1145/2384616.2384641
- Meawad, Richards, Morandat, Vitek. “Eval begone!: semi-automated removal of eval from JavaScript programs.” OOPSLA ’12 Proceedings of the ACM international conference on Object oriented programming systems languages and applications, 2012. DOI: 10.1145/2384616.2384660
- Morandat, Hill, Osvald, Vitek. “Evaluating the Design of the R Language.” European Conference on Object-Oriented Programming, 2011, p.104. doi: 10.1007/978-3-642-31057-7_6
- Terei, Aiken, Vitek. “M3: high-performance memory management from off-the-shelf components.” ISMM ’14 Proceedings of the 2014 international symposium on Memory management, 2014, p.3. DOI: 10.1145/2602988.2602995