IRON: Reducing Workload Interference on Massively Parallel Platforms

Interconnect networks with Dragonfly and Fat tree configurations are dominant in high-performance computing facilities and data centers. A key challenge of managing these shared networks is workload interference. In a multi-user computing environment, interference among applications for shared network resources can cause a vicious cycle of events (workload interference, low productivity, selfish user behavior, and poor scheduling) aggravating each other. This project aims to address this fundamental problem on massively parallel systems by developing the IRON (Interference ReductiON) framework. The project consists of three research thrusts: (1) flit-level network modeling and simulation to gain insights into workload interference and further to explore various what-if questions in terms of workload interference, (2) interference-aware scheduling to mitigate network congestion/contention among applications, and (3) experiments to quantitatively characterize workload interference and to assess the interference-aware scheduling design.

The project has three key outcomes. First is the experimental study to quantitatively measure workload interference of representative applications on production systems. The second is the development of high-fidelity modeling tool to simulate and study complex network interference on high-performance computing systems. The third is the design of interference aware methods, including routing and scheduling, to reduce network contention among applications.  The project delivers several open-source software artifacts for interference analysis and reduction on large-scale interconnect networks. The integrated education and outreach plan has enhanced the Computer Science curriculum and broadened the participation by underrepresented groups.  

Faculty:
  • Zhiling Lan (PI)

  • Graduate Students:
  • Xin Wang (PhD, 2018 - 2022)
  • Yao Kang (PhD, 2019 - 2022)
  • Yuping Fan (PhD, 2018-2021)
  • Boyang Li (PhD, 2020 - 2021)
  • Matt Dearing (PhD,  2021 - 2022)
  • Prashant Ravi (PhD, 2022 - 2022)
  • Melanie Cornelius (PhD, 2021-2022)
  • Dustin Favorite (MS, 2021 - 2022)
  • Naunidh Singh (MS,  2021 - 2021)
  • Zhong Zheng (BS, 2021-2022)
  • Hunter Negron (BS, 2021-2022)
  • Hannah Greenblatt (BS, 2022)

  • Collaborators:
  • Xu Yang (Amazon)
  • Misbah Mubarak (Argonne, Amazon)
  • Rob Ross (Argonne)
  • Paul Rich (Argonne)
  • Bill Allcock (Argonne)
  • Mike Papka (Argonne)
  • Sudheer Chunduri (Argonne)
  • Kevin Harms (Argonne)

  • Key Publications
  • Y. Fan, B.Li, D. Favorite, N. Singh, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan,  “DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing”, IEEE Transactions on Parallel and Distributed Systems (TPDS), 2022.
  • Y. Kang, X. Wang, Z. Lan, "Mitigating Network Contention with Intelligent Routing", Proc of ACM/IEEE SC, 2022.
  • Y. Kang, X. Wang, and Z. Lan, "Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network", ACM HPDC, 2021. [PDF]
  • Y. Fan, Z. Lan, T. Childers, P. Rich, W. Allcock, and M. Papka, "Deep Reinforcement Agent for Scheduling in HPC", IPDPS, 2021. [PDF]
  • Y. Fan and Z. Lan, "DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling", Software Impacts, 2021. [PDF]
  • B. Li, Y. Fan, and Z. Lan, "Direct Future Prediction Agent for Multi-Resource Scheduling in HPC", Technical report, Illinois Tech, May 2021.
  • X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, "Union: An Automatic Workload manager for Accelerating Network Simulation", Proc. of IPDPS, 2020. [PDF]
  • Y. Kang, "Study of I/O and Communication Traffic Interference on Dragonfly System", Talk at Argonne National Lab, August, 2019.
  • Y. Kang, X. Wang, N. McGlohon, M. Mubarak, S. Chunduri, and Z. Lan, "Modeling and Analysis of Application Interference on Dragonfly+", Proc. of ACM SIGSIM PADS'19 , 2019.[PDF]
  • B. Li, S. Chuduri, K. Harms, Y. Fan, and Z. Lan, "The Effect of System Utilization on Application Performance Variability", Proc. of ROSS'19 (Runtime and Operating Systems for Supercomputers) , 2019.[PDF]
  • X. Wang, M.Mubarak, X. Yang, R. Ross, and Z.Lan, "Trade-off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System", Proc. of IPDPS'18 , 2018.[PDF

  • Software Tools and Data:
  • (Software) Union/CODES. It is an automatic workload manager for workload interference analysis in CODES. It is available on the team's GitHub [Link]
  • (Software) CODES Dragonfly+ Module. It is released to the public as open-source dfp-fpar branch in the CODES GitHub. [Link]
  •  (Software) Q-adaptive/SST. It is a reinforcement learning driven routing design for preventing network contention on Dragonfly systems.  It is implemented in the SST toolkit and is available on the team's GitHub [Link]
  • (Software) DRAS/CQGym.   It is a common platform for evaluating reinforcement learning scheduling agents versus existing heuristic and optimization based methods.  It is available on the team's GitHub [Link]
  • (Data) Application communication traces collected on the 11.69-petaflow Cray XC40 machine Theta at ALCF. [Link]

  • Contact:
    Dr. Zhiling Lan (lan AT iit DOT edu)

    Acknowlegement:
    This project is supported by the US National Science Foundation (CNS-1717763). Note: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.