IRON Project

Intelligent Management of Hybrid Workloads for Extreme Scale Computing

The high-performance computing (HPC) community is embracing artificial intelligence (AI) techniques for countless pursuits, from driving ground-breaking scientific discoveries to protecting our national security. As newly emerging machine learning and data-centric workloads proliferate in HPC, current workload-management systems cannot keep up with the significant challenges introduced by the diverse mix of applications co-running on heterogeneous systems. This project tackles the problem by developing an intelligent workload-management framework named MINT (Multi-resource INtelligenT management) in which distinctive computational resource requirements of hybrid workloads will be automatically identified and fulfilled to achieve extreme resource efficiency and satisfactory user experience. The project will develop fundamental improvements in HPC workload management to promote the use of large-scale supercomputers for emerging data-centric applications (HPC4AI). Meanwhile it will exploit advanced AI technologies, especially multi-objective reinforcement learning, to empower job scheduling and resource allocation in HPC (AI4HPC). Key research thrusts include understanding performance implications of diverse workloads on supercomputers via model-driven analysis, new intelligent multi-resource scheduling methods, smart resource-allocation strategies for minimal workload interference, and extensive evaluation of the proposed framework through trace-based simulation and testing.

Faculty:

Zhiling Lan (PI)

Kai Shu (co PI)

Graduate Students:

Yuping Fan (PhD, graduated in 12/2021)

Boyang Li (PhD)

Matt Dearing (PhD)

Xiongxiao Xu (PhD)

Prashant Ravi (PhD)

Zhong Zheng (PhD)

Collaborators:

Mike Papka (ANL)

Bill Allcock (ANL)

Paul Rich (ANL)

Key Publications

Y. Kang, X. Wang, and Z. Lan, “Mitigating Network Contention with Intelligent Routing”, ACM/IEEE SC, 2022.

B. Li, M. Dearing, Y. Fan, P. Rich, B. Allcock, M. Papka, and Z. Lan, "MRSch: Multi-Resource Scheduling for HPC", IEEE Cluster, 2022.

B. Li, Y. Fan, M. Papka, and Z. Lan, “Encoding for Reinforcement Learning Driven Scheduling”, JSSPP, co-located with IPDPS, 2022.

Y. Fan, B. Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing”, IEEE TPDS (under revision), 2022.

Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, "Deep Reinforcement Agent for Scheduling in HPC", IPDPS, 2021. [PDF]

Y. Fan and Z. Lan, "DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling", Software Impacts, 2021. [PDF]

Software Tools and Data:

(Software) DRAS/CQSim - a discrete event driven scheduling simulator empowered by reinforcement learning. It is available on the team's GitHub [Link]

(Software) CQGym - a common platform for studying various cluster scheduling policies under the same setting. In CQGym, a discrete event driven scheduling environment is integrated with a scheduling agent such as deep reinforcement learning agent through openAI Gym interface. It is available on the team's GitHub [Link]

Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)

Acknowlegement:
This project is supported by the US National Science Foundation (CCF 2109316). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.