RCE 84: Scalable Checkpoint/Restart




RCE - Super Computers show

Summary: https://computation-rnd.llnl.gov/scr/ Multilevel checkpointing allows applications to take both frequent inexpensive checkpoints and less frequent, more resilient checkpoints, resulting in better efficiency and reduced load on the parallel file system. The slowest but most resilient level writes to the parallel file system, which can withstand an entire system failure. Faster checkpointing for the most common failure modes uses node-local storage, such as RAM, Flash, or disk, and applies cross-node redundancy schemes. Most failures only disable one or two nodes, and multinode failures often disable nodes in a predictable pattern. Thus, an application can usually recover from a less resilient checkpoint level, given well-chosen redundancy schemes.