Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Lead PI
Abstract
Society’s increasingly complex cyberinfrastructure creates a concern for software robustness and reliability. Yet, this same complex infrastructure is threatening the continued use of fault tolerance. Consider when a single application or hardware device crashes. Today, in order to resume that application from the point where it crashed, one must also consider the complex subsystem to which it belongs. While in the past, many developers would write application-specific code to support fault tolerance for a single application, this strategy is no longer feasible when restarting the many inter-connected applications of a complex subsystem.
This project will support a plugin architecture for transparent checkpoint-restart. Transparency implies that the software developer does not need to write any application-specific code. The plugin architecture implies that each software developer writes the necessary plugins only once. Each plugin takes responsibility for resuming any interrupted sessions for just one particular component. At a higher level, the checkpoint-restart system employs an ensemble of autonomous plugins operating on all of the applications of a complex subsystem, without any need for application-specific code.
The plugin architecture is part of a more general approach called process virtualization, in which all subsystems external to a process are virtualized. It will be built on top of the DMTCP checkpoint-restart system. One simple example of process virtualization is virtualization of ids. A plugin maintains a virtualization table and arranges for the application code of the process to see only virtual ids, while the outside world sees the real id. Any system calls and library calls using this real id are extended to translate between real and virtual id. On restart, the real ids are updated with the latest value, and the process memory remains unmodified, since it contains only virtual ids. Other techniques employing process virtualization include shadow device drivers, record-replay logs, and protocol virtualization. Some targets of the research include transparent checkpoint-restart support for the InfiniBand network, for programmable GPUs (including shaders), for networks of virtual machines, for big data systems such as Hadoop, and for mobile computing platforms such as Android.
Funding
Related Publications
- Kapil Arya and Gene Cooperman. “DMTCP: bringing interactive checkpoint–restart to Python,” Computational Science & Discovery, v.8, 2015, p. 16 pages. DOI: 10.1088/issn.1749-4699
- Jiajun Cao, Matthieu Simoni, Gene Cooperman, and Christine Morin. “Checkpointing as a Service in Heterogeneous Cloud Environments,” Proc. of 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid’15), 2015, p. 61–70. DOI: 10.1109/CCGrid.2015.160