SPEaR: Toward Smart HPC through Active Learning and Intelligent Scheduling
As high performance computing (HPC) continues to grow in scale, energy and resilience
become first-class concerns, in addition to the pursuit of performance. These concerns
demand significant changes in many aspects of the system stack including resource
management and job scheduling. In order to harness the great potential of extreme scale
systems, this project aims to incorporate intelligence into resource management and job
scheduling. More specifically, it will develop a framework named SPEaR (Scheduling for
Performance, Energy, and Resilience efficiency) for dynamically optimizing the
three-dimensional performance, energy, and resilience scheduling. The research focuses on
two thrusts: one is active learning to automatically extract valuable performance, energy,
and resilience patterns and tradeoffs out of application and system data, and the other is
intelligent scheduling to improve and control performance, resilience, and energy
efficiency in resource management and scheduling. An event-driven scheduling simulator is
being developed for comprehensively evaluating scheduling policies and their aggregate
effects. The simulator, along with system logs, will be made available to the broad
community under an open source license.
This project creates critical technologies to promote system productivity and makes
important advances essential toward smart HPC. Additionally, the learning techniques
developed in this project are useful to other big data problems of national interests. The
education plan enhances the undergraduate and graduate curricula and broadens the
participation from underrepresented groups.
A 1-page poster summary is available:poster-2016.
PI:
Zhiling Lan
Graduate Students:
Yao Kang (Ph.D. student)
Melanie Cornelius (Ph.D. student)
Boyang Li (Ph.D. student)
Xin Wang (Ph.D. student, Summer'19)
Yuping Fan (Ph.D. student, Summer'19)
Peixin Qiao (Ph.D. student, 2017-2019)
Sirisha Cherala (MS student, 2019 Spring)
Li Yu (graduated in 2016, and joined Google Inc.)
Eduardo Berrocal (granduated in 2017, and joined Intel)
Sean Wallace (graduated in 2017, and joined Cray)
Xu Yang (graduated in 2017, and joined Amazon)
Undergraduate Students:
Sergio Servantez (BS, 2018-2019)
Zhen Huang (BS, SC18 Cluster Competition, 2018-2019)
Blake Ehrenbeck (BS, SC18 Cluster Competition, 2018-2019)
Brianna Bransfield (BS, SC18 Cluster Competition)
Collaborators:
Mike Papka (Argonne & NIU)
Susan Coghlan (Argonne)
William Allcock (Argonne)
Franck Cappello (Argonne)
Rob Ross (Argonne)
Sudhere Chunduri (Argonne)
Kevin Harms (Argonne)
Venkat Vishwanath (Argonne)
John Jenkins (Argonne)
Paul Rich (Argonne)
Misbah Mubarak (Argonne)
Sheng Di (Argonne)
Leonardo Bautista-Gomez (Argonne)
Vitali Morozov (Argonne)
Wei Tang (Google)
Narayan Desai (Google)
Key Publications [Link]
Y. Fan, Z. Lan, P. Rich, W. Allcock, M. Papka, B. Austin, and D. Paul,
"Scheduling Beyond CPUs for HPC",
Proc. of HPDC'19 , 2019.
Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, Z. Lan,
"Modeling and Analysis of Application Interference on Dragonfly+",
Proc. of SIGSIM PADS'19 , 2019.
B. Li, S. Chunduri, K. Harms, Y. Fan, and Z. Lan
"The Effect of System Utilization on Application Performance Variability",
Proc. of ROSS'19 , 2019.
S. Servantez, R. Zamora, F. Tessier, Z. Lan,
"Using RAM Area Network to Reduce Synchchronization Costs in Collective I/O Operations",
Research Poster , SIAM Conference on Computational Science and Engineering (CSE19), 2019.
[poster]
M. Cornelius, W. Allcock, B. Toonen, Z. Lan, Z. Cornelius,
"Machine Learning on a RAM Area Network",
Research Poster , IEEE/ACM SC'18, 2018.
[poster]
L. Yu, Z. Zhou, Y, Fan, M. Papka, and Z. Lan,
"System-wide Treadeoff Modeling of Performance, Power, and Resilience on Petascale Systems",
Journal of Supercomputing ,
2018.
E. Berrocal, L. Gomez, D. Sheng, Z. Lan, and F. Cappello,
"Toward General Software Level Silent Data Corruption Detection for Parallel Applications",
IEEE. Trans. on Parallel and Distributed Systems (TPDS),
2017.
P. Qiao, X. Wang, X. Yang, Y. Fan, and Z. Lan,
"Joint Effects of Application Communication Pattern, Job Placement, and Network Routing on Fat-Tree Systems",
Proc. of ICPP-W ,
2018.
Y. Fan, P. Rich, W. Allcock, M. Papka, and Z. Lan,
"Trade-off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates",
Proc. of IEEE Cluster'17 (acceptance rate is 21.8%),
2017.
W. Allcock, P. Rich, Y. Fan, and Z. Lan,
"Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne",
Proc. of the 21st workshop on Job Scheduling Strategies for
Parallel Processing (JSSPP), 2017.
X. Yang, J. Jenkins, M. Mubarak,
R. Ross, and Z.Lan, "Watch Out for the Bully! Job Interference Study on Dragonfly Network",
Proc. of SC16 (acceptance rate is 18%),
2016.
S. Wallace, X. Yang, V. Vishwanath,
W. Allcock, S. Coghlan, M. Papka, and Z. Lan, "A Data Driven Scheduling Approach for Power Management
on HPC Systems", Proc. of SC16 (acceptance rate is 18%),
2016.
Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang,
V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus
Network Allocation Constraints", IEEE Transactions on Parallel
and Distributed Systems ,
2016.
Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang,
V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus
Network Allocation Constraints", IEEE Transactions on Parallel
and Distributed Systems ,
2016.
Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang,
J. Wang, and Z. Lan, "I/O Aware Job Scheduling and Bandwidth Allocation for Petascale Computing
Systems", Journal of Parallel Computing (ParCo),
2016.
S. Wallace, Z. Zhou, V. Vishwanath, S. Coghlan,
J. Tramm, Z. Lan, and M.E. Papka, "Application Power Profiling on IBM Blue Gene/Q",
Journal of Parallel Computing (ParCo) ,
2016.
E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan,
and F. Cappello, "Exploring Partial Replication to Improve Lightweight Silent Data
Corruption Detection for HPC Applications", Proc. of Euro-Par,
2016.
L. Yu, Z. Zhou, S. Wallace,
M.E, Papka, and Z. Lan, "Quantitative Modeling of Power Performance Tradeoffs on Extreme
Scale Systems", Journal of Parallel and Distributed Computing ,
2015.
L. Yu and Z. Lan, "A Scalable, Non-Parametric
Anomaly Detection Method for Large Scale Computing", IEEE Transactions
on Parallel and Distributed Systems , vol. 99(7), pp. 1902-1914,
2015.
E. Berrocal, L. Bautista-Gomez, S. Di, Z. Lan,
and F. Cappello, "Lightweight Silient Data Correpution Detection Based on Runtime Data Analysis
for HPC Applications" (short paper), Proc. of HPDC'15, 2015.
E. Berrocal, L. Yu, S. Wallace, M. Papka, and Z. Lan,
"Exploring Void Search for Fault Detection on Extreme Scale Systems" (Best Paper Award),
Proc. of IEEE Cluster'14 ,2014.
Z. Zheng, L. Yu, and Z.Lan,
"Reliability-Aware Speedup Models for Parallel Applications with Coordinated Checkpointing/Restart",
IEEE Trans. on Computers , 2014.
Software Tools and Data:
(Software) CQSim - a trace-based, event-driven scheduling simulator. [Link]
(Software) CODES - a flit-level, event-driven simulation toolkit for distributed system simulations. [Link]
(Software) PuPPET - a Petri net based modeling tool for quantitative power and performance
analysis [Link]
(Software) TOPPER - a Petri net based modeling tool for quantitative analysis of performance,
power, and resilience [Link]
(Data) Workload traces at ALCF [Link].
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
Acknowlegement:
This project is supported by the US National Science Foundation (CCF-1422009).
Note: Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation.