AI-enabled science, where advanced machine-learning technologies are
used for surrogate models, auto tuning, and in situ data analysis,
is quickly being adopted in science and engineering for tackling
complex and challenging computational problems. The wide adoption of
heterogeneous systems embedded with different types of processing
devices (CPUs, GPUs, and AI accelerators) further complicates the
execution of AI-enabled science on supercomputers. The research for
AI-enabled simulations on heterogeneous systems is far from
sufficient.
The long-term research vision is to develop SEEr, a
Scalable, Energy-Efficient HPC environment for scaling up and
accelerating AI-enabled science for scientific discovery. This
planning project explores fundamental questions to realize the
research vision. The team focuses on scalable surrogate models for
an incompressible computational fluid dynamics application using
OpenFOAM, cost models for this application on heterogeneous
resources, dynamic task mapping for efficient execution, and
performance and power monitoring and characterization to explore
tradeoffs among performance, scalability, and energy efficiency on a
state-of-the-art heterogeneous testbed at ALCF. The unified team of
researchers tackles the problem in a cross-layer manner, focusing on
the synergies among application algorithms, programming languages
and compilers, runtime systems, and high-performance computing.
Faculty:
Zhiling Lan, Stefan Muller, Romit Maulik (Illinois Tech)
Mantis, a unified performance and power profiling interface on
applications running on heterogeneous systems [https://github.com/SPEAR-IIT/mantis]
Two AI-enabled applications (mini-app and PythonFOAM)
explored in the SEEr planning project. The codes and run scrips
for the heterogeneous CPU-GPU systems at Argonne Leadership
Computing Facility (ALCF) are available at the team's GitHub
repository [https://github.com/SPEAR-IIT/SEEr]
Technical Reports at Illinois Tech:
H.
Greenblatt, H. Negron, M. Cornelius, S. Muller, R.
Maulik, X. Wu, M. Papka, and V. Taylor, “Performance
Characterization of AI-enabled Scientific Applications”,
Technical report, August 2022. [pdf]
H. Greenblatt, “CS597 Report:
Study of PythonFoam on ThetaGPU”,
Technical report, Dec 2022. [pdf]
H. Negron and P. Naik, “CS597
Report: Mini-App Analysis on Polaris and
ThetaGPU”, Technical report, Dec 2022. [pdf]
M.
Cornelius, H. Greenblatt, and Z. Lan, “Mantis: A
Unified Performance and Power Profiling
Interface on Heterogeneous Systems”, Technical
report, August 2022. [pdf]
H.
Greenblatt, “CS597 Report: PythonFoam Benchmarking”,
Technical report, May 2022. [pdf]
H.
Negron and Z. Zheng, “CS597 Report: Mini-app
Benchmarking”, Technical report, May 2022. [pdf]
Publications:
X.
Wu, V. Taylor, and Z. Lan, Performance and Energy
Improvement of the ECP Proxy App SW4lite under Various
Workloads, SC2021 Workshop on Memory-Centric High
Performance Computing (MCHPC’21), Nov. 2021. [pdf]
X.
Wu, V. Taylor, and Z. Lan, Performance and Power
Modeling and Prediction Using MuMMI and Ten
Machine Learning Methods, Concurrency and
Computation Practice and Experience, August
2022, https://doi.org/10.1002/cpe.7254. [pdf]
Yao Kang, Xin Wang, and Zhiling Lan. Study of Workload
Interference with Intelligent Routing on Dragonfly. In
Proceedings of SC ’22, 2022. [pdf]
Boyang Li, Yuping Fan, Matthew Dearing, Zhiling Lan, Paul
Rich, William Allcock, and Michael Papka. MRSch: Multi-resource
Scheduling for HPC. In 2022 IEEE International Conference on
Cluster
Computing (CLUSTER), pages 47–57, 2022. [pdf]
Yuping Fan, Zhiling Lan, Paul Rich, William Allcock, and
Michael E. Papka. Hybrid Workload Scheduling on HPC Systems. In
2022 IEEE International Parallel and Distributed Processing
Symposium
(IPDPS), pages 470–480, 2022. [pdf]
Yuping Fan, Boyang Li, Dustin Favorite, Naunidh Singh, Taylor
Childers, Paul Rich, William Allcock, Michael E. Papka, and
Zhiling Lan. DRAS: Deep Reinforcement Learning for Cluster
Scheduling
in High Performance Computing. IEEE Transactions on Parallel and
Distributed Systems, 33(12):4903–4917, 2022. [link]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
Acknowledgement:
This project is supported by the US National Science
Foundation (CCF 2119294, 2119203, 2119056). Note: Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s) and do
not necessarily reflect the views of the National Science
Foundation.