Gene Cooperman
Research Interests
- Fault tolerance and transparent checkpointing
- Supercomputing, parallel computing, cloud computing
- Formal verification
- Cybersecurity
Education
- PhD in Applied Mathematics, Brown University
- BS in Mathematics and Physics, University of Michigan
Pronouns
he/him/his
Biography
Gene Cooperman is a professor at the Khoury College of Computer Sciences and an affiliated faculty member at the College of Engineering at Northeastern University. He has worked in a series of interdisciplinary research areas, including applied mathematics, computational and symbolic algebra, numerical analysis, computing in high energy physics, bioinformatics, high-performance computing, and computer systems.
Before joining Northeastern, he was a principal MTS at GTE Laboratories from 1980-1986. Cooperman received his bachelor’s degree in 1974 from the University of Michigan, and his doctorate from Brown University in 1978. He has also held a five-year IDEX Chair of Attractivity position at the University of Toulouse in France, as well as visiting research positions at Concordia University, CERN, and Inria. As a result of the work at CERN, he joined the Geant4 Collaboration and contributed to the foundational paper “GEANT4 – A Simulation Toolkit,” which currently has approximately 25,000 citations and is the most widely cited paper in high energy physics.
Cooperman leads the High Performance Computing Laboratory at Northeastern University, and he currently co-leads an Inria associate team in a three-year project called “FogRein: Steering Efficiency for Distributed Applications.” He has co-authored more than 100 refereed publications, advised doctoral students, and personally led several open-source software projects:
- Earlier projects: TOP-C (Task-Oriented Parallel C/C++); Roomy (middleware for big data, used to show that 26 moves suffice for Rubik’s Cube); and ParGeant4 (distributed parallelism for CERN-based Geant4 software for Monte Carlo particle-matter interaction in high energy physics)
- Geant4-MT: Geant4-Multi-Threaded – a 5-year project to retroactively introduce multi-threading to the Geant4 production software (see the previous bullet for the goals of Geant4)
- DMTCP: Distributed MultiThreaded Checkpointing — 15-year ongoing project providing transparent checkpoint-restart for most Linux applications, including MPI, CUDA for GPUs, most HPC interconnect networks, and other heterogeneous environments
The Geant4-MT project (Geant4 Multithreaded) culminated in January 2014 with the incorporation of Geant4-MT into the Geant4 version 10.0 release after extensive validation testing by the Geant4 collaboration – it is now maintained directly by the Geant4 consortium. In the 15 years prior to Geant4-MT, Geant4 had grown purely as a single-threaded package of almost a million lines of code. Retroactively adding multi-threading to the Geant4 production software was a major undertaking resulting in refereed publications and a full description in the doctoral thesis of Xin Dong.
The ongoing DMTCP project (Distributed MultiThreaded Checkpointing) supports transparent checkpointing (snapshots) with no modification to the target application binary. DMTCP extends transparent checkpoint support to external hardware/software environments like GPUs and network interconnects to support MPI for HPC. While the roots of this project began in 2004, DMTCP now incorporates results from a series of doctoral theses and other student work and has been used by independent researchers in more than 150 refereed research publications. Application domains using DMTCP include circuit verification, formal verification, CPU chip design by Intel and others, VLSI circuit simulators, formalization of mathematics, bioinformatics, network simulation, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and high-performance computing (HPC).
The newest direction for DMTCP is to make it a standard for supercomputing and HPC. In collaboration with the DOE’s NERSC supercomputing center, the DMTCP project (including MANA for MPI and CRAC for CUDA) is being extended and validated for production use. This will be used on NERSC’s Perlmutter supercomputer (expected to become the #6 supercomputer in the world when fully installed). The functionality provided by DMTCP, MANA, and CRAC will enable scientists to execute long-running computations by using checkpoint-restart to chain together multiple allocation time slots. Currently, users are limited to a maximum allocation time slot of 48 hours. This showcase project will allow other HPC centers to also use this new technology.
Research Interests
- Fault tolerance and transparent checkpointing
- Supercomputing, parallel computing, cloud computing
- Formal verification
- Cybersecurity
Education
- PhD in Applied Mathematics, Brown University
- BS in Mathematics and Physics, University of Michigan
Pronouns
he/him/his
Biography
Gene Cooperman is a professor at the Khoury College of Computer Sciences and an affiliated faculty member at the College of Engineering at Northeastern University. He has worked in a series of interdisciplinary research areas, including applied mathematics, computational and symbolic algebra, numerical analysis, computing in high energy physics, bioinformatics, high-performance computing, and computer systems.
Before joining Northeastern, he was a principal MTS at GTE Laboratories from 1980-1986. Cooperman received his bachelor’s degree in 1974 from the University of Michigan, and his doctorate from Brown University in 1978. He has also held a five-year IDEX Chair of Attractivity position at the University of Toulouse in France, as well as visiting research positions at Concordia University, CERN, and Inria. As a result of the work at CERN, he joined the Geant4 Collaboration and contributed to the foundational paper “GEANT4 – A Simulation Toolkit,” which currently has approximately 25,000 citations and is the most widely cited paper in high energy physics.
Cooperman leads the High Performance Computing Laboratory at Northeastern University, and he currently co-leads an Inria associate team in a three-year project called “FogRein: Steering Efficiency for Distributed Applications.” He has co-authored more than 100 refereed publications, advised doctoral students, and personally led several open-source software projects:
- Earlier projects: TOP-C (Task-Oriented Parallel C/C++); Roomy (middleware for big data, used to show that 26 moves suffice for Rubik’s Cube); and ParGeant4 (distributed parallelism for CERN-based Geant4 software for Monte Carlo particle-matter interaction in high energy physics)
- Geant4-MT: Geant4-Multi-Threaded – a 5-year project to retroactively introduce multi-threading to the Geant4 production software (see the previous bullet for the goals of Geant4)
- DMTCP: Distributed MultiThreaded Checkpointing — 15-year ongoing project providing transparent checkpoint-restart for most Linux applications, including MPI, CUDA for GPUs, most HPC interconnect networks, and other heterogeneous environments
The Geant4-MT project (Geant4 Multithreaded) culminated in January 2014 with the incorporation of Geant4-MT into the Geant4 version 10.0 release after extensive validation testing by the Geant4 collaboration – it is now maintained directly by the Geant4 consortium. In the 15 years prior to Geant4-MT, Geant4 had grown purely as a single-threaded package of almost a million lines of code. Retroactively adding multi-threading to the Geant4 production software was a major undertaking resulting in refereed publications and a full description in the doctoral thesis of Xin Dong.
The ongoing DMTCP project (Distributed MultiThreaded Checkpointing) supports transparent checkpointing (snapshots) with no modification to the target application binary. DMTCP extends transparent checkpoint support to external hardware/software environments like GPUs and network interconnects to support MPI for HPC. While the roots of this project began in 2004, DMTCP now incorporates results from a series of doctoral theses and other student work and has been used by independent researchers in more than 150 refereed research publications. Application domains using DMTCP include circuit verification, formal verification, CPU chip design by Intel and others, VLSI circuit simulators, formalization of mathematics, bioinformatics, network simulation, high energy physics, cyber-security, big data, middleware, mobile computing, cloud computing, virtualization of GPUs, and high-performance computing (HPC).
The newest direction for DMTCP is to make it a standard for supercomputing and HPC. In collaboration with the DOE’s NERSC supercomputing center, the DMTCP project (including MANA for MPI and CRAC for CUDA) is being extended and validated for production use. This will be used on NERSC’s Perlmutter supercomputer (expected to become the #6 supercomputer in the world when fully installed). The functionality provided by DMTCP, MANA, and CRAC will enable scientists to execute long-running computations by using checkpoint-restart to chain together multiple allocation time slots. Currently, users are limited to a maximum allocation time slot of 48 hours. This showcase project will allow other HPC centers to also use this new technology.