IRON: Reducing Workload Interference on Massively Parallel
Platforms
Interconnect networks with Dragonfly and Fat tree
configurations are dominant in high-performance computing
facilities and data centers. A key challenge of managing these
shared networks is workload interference. In a multi-user
computing environment, interference among applications for
shared network resources can cause a vicious cycle of events
(workload interference, low productivity, selfish user behavior,
and poor scheduling) aggravating each other. This project aims
to address this fundamental problem on massively parallel
systems by developing the IRON (Interference
ReductiON) framework. The project
consists of three research thrusts: (1) flit-level network
modeling and simulation to gain insights into workload
interference and further to explore various what-if questions in
terms of workload interference, (2) interference-aware
scheduling to mitigate network congestion/contention among
applications, and (3) experiments to quantitatively characterize
workload interference and to assess the interference-aware
scheduling design.
The project has three key outcomes. First is the
experimental study to quantitatively measure workload
interference of representative applications on production
systems. The second is the development of high-fidelity modeling
tool to simulate and study complex network interference on
high-performance computing systems. The third is the design of
interference aware methods, including routing and scheduling, to
reduce network contention among applications. The project
delivers several open-source software artifacts for interference
analysis and reduction on large-scale interconnect networks. The
integrated education and outreach plan has enhanced the Computer
Science curriculum and broadened the participation by
underrepresented groups.
Faculty:
Zhiling Lan (PI)
Graduate Students:
Xin Wang (PhD, 2018 - 2022)
Yao Kang (PhD, 2019 - 2022)
Yuping Fan (PhD, 2018-2021)
Boyang Li (PhD, 2020 - 2021)
Matt Dearing (PhD, 2021 - 2022)
Prashant Ravi (PhD, 2022 - 2022)
Melanie Cornelius (PhD, 2021-2022)
Dustin Favorite (MS, 2021 - 2022)
Naunidh Singh (MS, 2021 - 2021)
Zhong Zheng (BS, 2021-2022)
Hunter Negron (BS, 2021-2022)
Hannah Greenblatt (BS, 2022)
Collaborators:
Xu Yang (Amazon)
Misbah Mubarak (Argonne, Amazon)
Rob Ross (Argonne)
Paul Rich (Argonne)
Bill Allcock (Argonne)
Mike Papka (Argonne)
Sudheer Chunduri (Argonne)
Kevin Harms (Argonne)
Key Publications
Y. Fan, B.Li, D. Favorite, N. Singh,
T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan,
“DRAS: Deep Reinforcement Learning for Cluster Scheduling in High
Performance Computing”, IEEE Transactions on Parallel and
Distributed Systems (TPDS), 2022.
Y. Kang, X. Wang, Z. Lan, "Mitigating
Network Contention with Intelligent Routing", Proc of
ACM/IEEE SC, 2022.
Y. Kang, X. Wang, and Z. Lan,
"Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on
Dragonfly Network", ACM HPDC, 2021. [PDF]
Y. Fan, Z. Lan, T. Childers, P. Rich,
W. Allcock, and M. Papka, "Deep Reinforcement Agent for Scheduling
in HPC", IPDPS, 2021. [PDF]
Y. Fan and Z. Lan, "DRAS-CQSim: A
Reinforcement Learning based Framework for HPC Cluster
Scheduling", Software Impacts, 2021. [PDF]
B. Li, Y. Fan, and Z. Lan, "Direct
Future Prediction Agent for Multi-Resource Scheduling in HPC",
Technical report, Illinois Tech, May 2021.
X. Wang, M. Mubarak, Y. Kang, R. Ross,
and Z. Lan, "Union: An Automatic Workload manager for Accelerating
Network Simulation", Proc. of IPDPS, 2020. [PDF]
Y. Kang, "Study of I/O and
Communication Traffic Interference on Dragonfly System", Talk at
Argonne National Lab, August, 2019.
Y. Kang, X. Wang,
N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan, "Modeling
and Analysis of Application Interference on Dragonfly+", Proc. of ACM SIGSIM PADS'19 ,
2019.[PDF]
B. Li, S.
Chuduri, K. Harms, Y. Fan, and Z. Lan, "The Effect of System
Utilization on Application Performance Variability", Proc. of ROSS'19 (Runtime and
Operating Systems for Supercomputers) , 2019.[PDF]
X. Wang,
M.Mubarak, X. Yang, R. Ross, and Z.Lan, "Trade-off Study
of Localizing Communication and Balancing Network Traffic
on a Dragonfly System",
Proc. of IPDPS'18 , 2018.[PDF]
Software Tools and Data:
(Software) Union/CODES. It is an automatic
workload manager for workload interference analysis in
CODES. It is available on the team's GitHub [Link]
(Software) CODES Dragonfly+Module. It
is released to the public as open-source dfp-fpar branch
in the CODES GitHub. [Link]
(Software) Q-adaptive/SST. It is a
reinforcement learning driven routing design for
preventing network contention on Dragonfly systems.
It is implemented in the SST toolkit and is available
on the team's GitHub [Link]
(Software) DRAS/CQGym. It is a
common platform for evaluating reinforcement learning
scheduling agents versus existing heuristic and
optimization based methods. It is available on the
team's GitHub [Link]
(Data) Application communication traces collected on
the 11.69-petaflow Cray XC40 machine Theta at ALCF. [Link]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
Acknowlegement:
This project is supported by the US National Science
Foundation (CNS-1717763). Note: Any opinions, findings, and
conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect
the views of the National Science Foundation.