Open MPI logo

Open MPI

  |   Home   |   Support   |   FAQ   |  

Title: Interconnect agnostic checkpoint/restart in Open MPI

Author(s):

Joshua Hursey, Timothy I. Mattox, Andrew Lumsdaine

Abstract:

Long running High Performance Computing (HPC) applications at scale must be able to tolerate inevitable faults if they are to harness current and future HPC systems. Message Passing Interface (MPI) level transparent checkpoint/restart fault tolerance is an appealing option to HPC application developers that do not wish to restructure their code. Historically, MPI implementations that provided this option have struggled to provide a full range of interconnect support, especially shared memory support. This paper presents a new approach for implementing checkpoint/restart coordination algorithms that allows the MPI implementation of checkpoint/restart to be interconnect agnostic. This approach allows an application to be checkpointed on one set of interconnects (e.g., InfiniBand and shared memory) and be restarted with a different set of interconnects (e.g., Myrinet and shared memory or Ethernet). By separating the network interconnect details from the checkpoint/restart coordination algorithm we allow the HPC application to respond to changes in the cluster environment such as interconnect unavailability due to switch failure, re-load balance on an existing machine, or migrate to a different machine with a different set of interconnects. We present results characterizing the performance impact of this approach on HPC applications.

Presented: Proceedings of the 18th ACM international symposium on High Performance Distributed Computing (HPDC 2009), on June 11-13, 2009, in Garching, Germany.

Paper:

hpdc-2009.pdf (PDF)

Bibtex reference:

 @inproceedings{1551619,
 author = {Hursey, Joshua and Mattox, Timothy I. and Lumsdaine, Andrew},
 title = {Interconnect agnostic checkpoint/restart in Open MPI},
 booktitle = {HPDC '09: Proceedings of the 18th ACM international symposium on High Performance Distributed Computing},
 year = {2009},
 isbn = {978-1-60558-587-1},
 pages = {49--58},
 location = {Garching, Germany},
 doi = {http://doi.acm.org/10.1145/1551609.1551619},
 publisher = {ACM},
 address = {New York, NY, USA},
 }