ORCHECK      Project

Introduction

ORCHECK stands for "ORCHEstrated CHECKpointing". Motivated by the recognition that I/O contention is a dominant factor that impedes the performance of parallel checkpointing, ORCHECK proposes a systematic approach to improving the performance of parallel checkpointing. The main idea of ORCHECK is to orchestrate the concurrent checkpoints in an optimized and controllable way to minimize the I/O contentions. The targeted platform for ORCHECK is large-scale parallel computing systems with multi-core architecture and parallel file system such as PVFS2.

From the perspective of Parallel File System (PFS), ORCHECK utilizes vertical checkpointing to rearrange the data layout of the checkpoint files to reduce the number of files serviced by each I/O server and the corresponding I/O contention.

From the perspective of checkpointing middleware, ORCHECK leverages a staged checkpointing marshaling technique to serialize the concurrent checkpoints on each compute node to further improve the checkpointing performance.

A prototype of ORCHECK is implemented at the system-level under Open MPI over the PVFS2 file system.

Features

  • Compatible with Open MPI version 1.4 and PVFS2 version 2.8.2.
  • Easy installation and configuration with the patch file.
  • Component independency: vertical checkpointing and the staged checkpointing marshaling can be set up individually.

  • People

    Faculty Advisors

    Xian-He Sun
    Yong Chen

    Graduate Students

    Hui Jin
    Jiayu Ji
    Tao Ke

    Publications

  • H. Jin, T. Ke, Y. Chen and X.-H. Sun, "Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment", in Proc. of IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2012.
  • H. Jin, "Checkpointing Orchestration for Performance Improvement", 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010) (Student paper)
  • Download

    Patched Packages

    • The patched packages include the compilable Open MPI/PVFS2 source code and the ORCHECK add-ons.
    • The patched packages are available upon request.


    * ORCHECK is developed as add-ons to Open MPI and PVFS2 and follows all the licences that are applied to Open MPI and PVFS2.
    Open MPI code base is licensed under the new BSD license .
    PVFS2 is released as GPL/LGPL.

    Acknowledgments

    This research was supported in part by National Science Foundation under NSF grant CCF-0621435, CCF-0937877, CNS-0834514, CNS-0751200, CCF-0702737, and DOE, SciDAC-2 (DE-FC02-06ER41442).
    The authors would like to acknowledge Joshua Hursey of Open MPI group at Indiana University and Samuel Lang of PVFS2 group at Argonne National Lab for their valuable assistance in the implementation of checkpointing orchestration.
    We are also thankful to Dr. Ioan Raicu of Illinois Institute of Technology and the MCS department at Argonne National Lab for the support to run large-scale simulations on the SiCortex computing system.

    Related Links

    Contact Us

    All form fields are required.

        

    All form fields are required.