Intelligent Management of Hybrid Workloads for Extreme Scale
Computing
The high-performance computing (HPC) community is embracing
artificial intelligence (AI) techniques for countless pursuits, from
driving ground-breaking scientific discoveries to protecting our
national security. As newly emerging machine learning and
data-centric workloads proliferate in HPC, current
workload-management systems cannot keep up with the significant
challenges introduced by the diverse mix of applications co-running
on heterogeneous systems. This project tackles the problem by
developing an intelligent workload-management framework named MINT
(Multi-resource INtelligenT management) in which distinctive
computational resource requirements of hybrid workloads will be
automatically identified and fulfilled to achieve extreme resource
efficiency and satisfactory user experience. The project will
develop fundamental improvements in HPC workload management to
promote the use of large-scale supercomputers for emerging
data-centric applications (HPC4AI). Meanwhile it will exploit
advanced AI technologies, especially multi-objective reinforcement
learning, to empower job scheduling and resource allocation in HPC (AI4HPC).
Key research thrusts include understanding performance implications
of diverse workloads on supercomputers via model-driven analysis,
new intelligent multi-resource scheduling methods, smart
resource-allocation strategies for minimal workload interference,
and extensive evaluation of the proposed framework through
trace-based simulation and testing.
Faculty:
Zhiling Lan (PI)
Kai Shu (co PI)
Graduate Students:
Yuping Fan (PhD, graduated in 12/2021)
Boyang Li (PhD)
Matt Dearing (PhD)
Xiongxiao Xu (PhD)
Prashant Ravi (PhD)
Zhong Zheng (PhD)
Collaborators:
Mike Papka (ANL)
Bill Allcock (ANL)
Paul Rich (ANL)
Key Publications
Y. Kang, X. Wang, and Z. Lan,
“Mitigating Network Contention with Intelligent Routing”, ACM/IEEE
SC, 2022.
B. Li, M. Dearing, Y. Fan, P. Rich,
B. Allcock, M. Papka, and Z. Lan, "MRSch: Multi-Resource
Scheduling for HPC", IEEE Cluster, 2022.
B. Li, Y. Fan, M. Papka, and Z. Lan,
“Encoding for Reinforcement Learning Driven Scheduling”, JSSPP,
co-located with IPDPS, 2022.
Y. Fan, B. Li, D. Favorite, N. Singh,
T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, “DRAS:
Deep Reinforcement Learning for Cluster Scheduling in High
Performance Computing”, IEEE TPDS (under revision),
2022.
Y. Fan, Z. Lan, T. Childers, P. Rich,
W. Allcock, and M. Papka, "Deep Reinforcement Agent for Scheduling
in HPC", IPDPS, 2021. [PDF]
Y. Fan and Z. Lan, "DRAS-CQSim: A
Reinforcement Learning based Framework for HPC Cluster
Scheduling", Software Impacts, 2021. [PDF]
Software Tools and Data:
(Software) DRAS/CQSim - a discrete event driven
scheduling simulator empowered by reinforcement learning.
It is available on the team's GitHub [Link]
(Software) CQGym - a common platform for
studying various cluster scheduling policies under the
same setting. In CQGym, a discrete event driven scheduling
environment is integrated with a scheduling agent such as
deep reinforcement learning agent through openAI Gym
interface. It is available on the team's GitHub [Link]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
Acknowlegement:
This project is supported by the US National Science
Foundation (CCF 2109316). Note: Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.