RAPS: Recovery Aware Parallel computing Systems
With rapid advances in processing technology, along with the emerging
multi-core processors and specialized co-processors, parallel processing
is permeating almost every aspect of our lives, from high-end computing to
commodity deployment. Production systems in the next few years are expected
to contain hundreds of thousands of processors, with each processor containing dozens
of cores. Fueled by the ever-growing scale and complexity of parallel systems,
these systems often fail in unpredictable ways. Studies have shown that in production
systems, failure rates range from 20 to more than 1000 per year, and
depending on root cause of the problem, the system-level mean time to
repair (MTTR) ranges from a couple of hours for failures caused by human
errors to nearly 100 hours for failures due to hardware problems. Years of intense research have focused
on pre-failure prediction and tolerance by predicting failures and taking precaution actions before failure
occurrence. Nevertheless, despite research progress on failure
prediction, unexpected failures frequently occur in practice, especially in modern parallel systems with
unprecedented sizes and complexities. Hence, relying on pre-failure prediction and tolerance alone is
insufficient for fault tolerance due to the inevitability of failures. Just as failures need to be
carefully avoided and tolerated, post-failure diagnosis and recovery (i.e., a procedure taken after
failures) is of equal importance and has a profound impact on almost every aspect of parallel computing.
The goal of this research project is to develop RAPS, a Recovery Aware Parallel computing System for
post-failure diagnosis and recovery. The research focuses on how to quickly and effectively resume
parallel computing after a failure has occurred. The ultimate goal is to seamlessly integrate post-failure
diagnosis and recovery with pre-failure prediction and tolerance as a compound fault management solution
for parallel computing. The approach consists of (1) development of new diagnosis mechanisms for fast
failure detection and root cause analysis, (2) development of system-wide orchestration for recovery
coordination, (3) design of new recovery techniques for quick restoration of parallel applications, and
(4) a comprehensive evaluation. The project also includes three integrated education activities: (1) recruiting and
training of graduate and undergraduate students; (2) enhancing CS curriculum, and (3) providing outreach programs for
underrepresented groups.
Collaborators:
(ANL) Narayan Desai, Daniel Buettner, Rajeev Thakur, Susan Coghlan, Rinku Gupta and Pete Beckman
(Sandia) Jim Brandt and Ann Gentile
(ORNL) Terry Jones, Byung-Hoon Park and Al Geist
Software Tools and Public Data Sets:
(Software) SysDP - an automated fault diagnosis and prognosis software toolkit for large-scale systems.
[Link]
(Software) QSim - an event-driven job scheduling simulator for Cobalt. [Link]
(Software) CQSim - A Extensible and Scalable Resource Management and Job Scheduling Simulator
[link]
(Software) schedshow - an analysis and visualization tool for job scheduling.
[Link]
(Data set) a 9-month RAS log collected from the 40-rack production Blue Gene/P system is released and stored
at USENIX Computer Failure Data Repository [Link]
(Data set) a 9-month workload trace collected from the 40-rack production Blue Gene/P system is released and
stored at Parallel Workloads Archive [Link]
Key Publications:
Z. Zheng, L. Yu, Z. Lan, and T. Jones,
"3-Dimensional Root Cause Diagnosis via Co-Analysis",
Proc. of ICAC, 2012.
L. Yu, Z. Zheng, Z. Lan, T. Jones, J. Brandt, and A. Gentile,
"Filtering Log Data: Finding the Needles in the Haystack",
Proc. of DSN, 2012.
Y. Li and Z. Lan,
"FREM: A Fast Restart Mechanism for General Checkpoint/Restart",
IEEE Trans. on Computers, 60(5), 639-652, 2010.
Z. Lan, Z. Zheng, and Y. Li,
"Toward Automated Anomaly Identification in Large-Scale Systems",
IEEE Trans. on Parallel and Distributed Systems, 21(2), 174-187, 2010.
Z. Zheng, Z. Lan, R. Gupta, S. Coghlan, and P.
beckman,
"A Practical Failure Prediction with Location and Lead Time for Blue Gene/P",
Proc. of the 1st Workshop on Fault-Tolerance for HPC at Extreme Scale
(FTXS), in conjunction with DSN'10, 2010.
Z. Lan, J. Gu, Z. Zheng, R.
Thakur, and S. Coghlan,
"A Study of Dynamic Meta-Learning for Failure Prediction in Large-Scale
Systems"
Journal of Parallel and Distributed Computing (JPDC),70(6), 630-643,
2010.
W. Tang, N. Desai, D. Buettner, and Z. Lan,
"Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P",
Proc. of IPDPS'10, 2010. [Best Paper Award]
W. Tang, Z. Lan, N. Desai, and D. Buettner,
"Automatic and Coordinated Job Recovery for High Performance Computing",
IEEE Workshop on Many-Task Computing on Grids and Supercomputers, 2010.
Y. Li, Z. Lan, P. Gujrati, and X-H. Sun,
"fault-Aware Runtime Strategies for High Performance Computing",
IEEE Trans. on Parallel and Distributed Systems, 20(4), 460-473, 2009.
Z. Zheng and Z. Lan,
"Reliability-Aware Scalability Models for High Performance Computing",
Proc. of IEEE Cluster'09, 2009.
Y. Li, Z. Lan, P. Gujrati, and X. Sun,
"Fault-Aware Runtime Strategies for High Performance Computing",
IEEE Trans. on Parallel and Distributed Systems , vol.
20(4), pp. 460-473, 2009.
Z. Zheng, Z. Lan,
B-H. Park,
and A. Geist,
"System Log Pre-processing to Improve Failure Prediction",
Proc. of DSN'09, 2009. [PDF]
Ziming Zheng, Rinku Gupta, Zhiling Lan, and Susan Coghlan,
"FTB-enabled Failure Prediction for Blue Gene/P Systems",
Proc. of SC'09 (research poster), 2009.
Z. Zheng and Z. Lan,
"Reliability-Aware Scalability Models for High Performance Computing",
Proc. of IEEE Cluster'09, 2009.
W. Tang, Z. Lan, N. Desai,
and D. Buettner,
"Fault-Aware Utility-Based Job Scheduling on Blue Gene/P Systems",
Proc. of IEEE Cluster'09, 2009.
B-H. Park, Z. Zheng, Z. Lan,
and A. Geist,
"System Log Pre-processing to Improve Failure Prediction",
Proc. of DSN'09, 2009.
H. Jin, X. Sun, Z. Zheng, Z. Lan and B. Xie,
"Performance under Failures of DAG-based Parallel Computing",
Proc. of CCGrid'09, 2009. [PDF]
B-H. Park, Z. Zheng, Z. Lan, and A. Geist,
"Analyzing Failure Events on ORNL's Cray XT4",
Proc. of SC'08 (research poster), 2008.
Y. Li and Z. Lan,
"A Fast Recovery Mechanism for Checkpointing in Networked Environments",
Proc. of DSN08, , 2008. [PDF]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
This work is supported by US National Science Foundation.