Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Fri 02.19.16
Enhancement and Support of DMTCP for Adaptive, Extensible Checkpoint-Restart
Fri 02.19.16
Fri 02.19.16
Fri 02.19.16
Fri 02.19.16
Fri 02.19.16
Society’s increasingly complex cyberinfrastructure creates a concern for software robustness and reliability. Yet, this same complex infrastructure is threatening the continued use of fault tolerance. Consider when a single application or hardware device crashes. Today, in order to resume that application from the point where it crashed, one must also consider the complex subsystem to which it belongs. While in the past, many developers would write application-specific code to support fault tolerance for a single application, this strategy is no longer feasible when restarting the many inter-connected applications of a complex subsystem.
This project will support a plugin architecture for transparent checkpoint-restart. Transparency implies that the software developer does not need to write any application-specific code. The plugin architecture implies that each software developer writes the necessary plugins only once. Each plugin takes responsibility for resuming any interrupted sessions for just one particular component. At a higher level, the checkpoint-restart system employs an ensemble of autonomous plugins operating on all of the applications of a complex subsystem, without any need for application-specific code.
The plugin architecture is part of a more general approach called process virtualization, in which all subsystems external to a process are virtualized. It will be built on top of the DMTCP checkpoint-restart system. One simple example of process virtualization is virtualization of ids. A plugin maintains a virtualization table and arranges for the application code of the process to see only virtual ids, while the outside world sees the real id. Any system calls and library calls using this real id are extended to translate between real and virtual id. On restart, the real ids are updated with the latest value, and the process memory remains unmodified, since it contains only virtual ids. Other techniques employing process virtualization include shadow device drivers, record-replay logs, and protocol virtualization. Some targets of the research include transparent checkpoint-restart support for the InfiniBand network, for programmable GPUs (including shaders), for networks of virtual machines, for big data systems such as Hadoop, and for mobile computing platforms such as Android.
Society’s increasingly complex cyberinfrastructure creates a concern for software robustness and reliability. Yet, this same complex infrastructure is threatening the continued use of fault tolerance. Consider when a single application or hardware device crashes. Today, in order to resume that application from the point where it crashed, one must also consider the complex subsystem to which it belongs. While in the past, many developers would write application-specific code to support fault tolerance for a single application, this strategy is no longer feasible when restarting the many inter-connected applications of a complex subsystem.
This project will support a plugin architecture for transparent checkpoint-restart. Transparency implies that the software developer does not need to write any application-specific code. The plugin architecture implies that each software developer writes the necessary plugins only once. Each plugin takes responsibility for resuming any interrupted sessions for just one particular component. At a higher level, the checkpoint-restart system employs an ensemble of autonomous plugins operating on all of the applications of a complex subsystem, without any need for application-specific code.
The plugin architecture is part of a more general approach called process virtualization, in which all subsystems external to a process are virtualized. It will be built on top of the DMTCP checkpoint-restart system. One simple example of process virtualization is virtualization of ids. A plugin maintains a virtualization table and arranges for the application code of the process to see only virtual ids, while the outside world sees the real id. Any system calls and library calls using this real id are extended to translate between real and virtual id. On restart, the real ids are updated with the latest value, and the process memory remains unmodified, since it contains only virtual ids. Other techniques employing process virtualization include shadow device drivers, record-replay logs, and protocol virtualization. Some targets of the research include transparent checkpoint-restart support for the InfiniBand network, for programmable GPUs (including shaders), for networks of virtual machines, for big data systems such as Hadoop, and for mobile computing platforms such as Android.