Language and Runtime Support for Large-Scale Data Analytics
Wed 12.23.15
Language and Runtime Support for Large-Scale Data Analytics
Wed 12.23.15
Wed 12.23.15
Wed 12.23.15
Wed 12.23.15
Wed 12.23.15
As the cost of computing and communication resources has plummeted, applications have become data-centric with data products growing explosively in both number and size. Although accessing such data using the compute power necessary for its analysis and processing is cheap and readily available via cloud computing (intuitive, utility-style access to vast resource pools), doing so currently requires significant expertise, experience, and time (for customization, configuration, deployment, etc).
This work investigates new models of cloud computing that combine domain-targeted languages with scalable data processing, sharing, and management abstractions within a distributed service platform that “scales” programmer productivity. To enable this, this research explores new programming language, runtime, and distributed systems techniques and technologies that integrate the R programming language environment with open source cloud platform-as-a-service (PaaS) in ways that simplify processing massive datasets, sharing datasets across applications and users, and tracking and enforcing data provenance. The PIs’ plans for research, outreach, integrated curricula, and open source release of research artifacts have the potential for making cloud computing more accessible to a much wider range of users: The data analytics community who use the R statistical analysis environment to apply their techniques and algorithms to important problems in areas such as biology, chemistry, physics, political science and finance, by enabling them to use cloud resources transparently for their analyses, and to share their scientific data/results in a way that enables others to reproduce and verify them.
As the cost of computing and communication resources has plummeted, applications have become data-centric with data products growing explosively in both number and size. Although accessing such data using the compute power necessary for its analysis and processing is cheap and readily available via cloud computing (intuitive, utility-style access to vast resource pools), doing so currently requires significant expertise, experience, and time (for customization, configuration, deployment, etc).
This work investigates new models of cloud computing that combine domain-targeted languages with scalable data processing, sharing, and management abstractions within a distributed service platform that “scales” programmer productivity. To enable this, this research explores new programming language, runtime, and distributed systems techniques and technologies that integrate the R programming language environment with open source cloud platform-as-a-service (PaaS) in ways that simplify processing massive datasets, sharing datasets across applications and users, and tracking and enforcing data provenance. The PIs’ plans for research, outreach, integrated curricula, and open source release of research artifacts have the potential for making cloud computing more accessible to a much wider range of users: The data analytics community who use the R statistical analysis environment to apply their techniques and algorithms to important problems in areas such as biology, chemistry, physics, political science and finance, by enabling them to use cloud resources transparently for their analyses, and to share their scientific data/results in a way that enables others to reproduce and verify them.