Collaborative Research:Experimental-based Research on
Effective Models of Parallel Application Execution Time, Power,
and Resilience
The increasing scale and complexity of parallel
systems present enormous challenges to parallel applications.
One such challenge is the integration and balancing of execution
time, power, and resilience for parallel applications. The
MuMMI_R project seeks to advance the scientific understanding of
the interdependence among power, execution time, and resilience
for various application-system configurations. The broader
impacts include training of undergraduate and graduate students
and the participation in programs such as REUs, CREU, and DREU
to increase the participation of students from underrepresented
groups in the project.
This project aims to develop effective techniques
for quantifying the complicated tradeoffs among execution time,
power, and resilience, and to provide a tuning mechanism for
user-defined metrics. The project consists of three research
thrusts: (1) experimental study of different application-system
configurations, (2) developing models for quantifying the
interplay between runtime, power, and resilience, and (3)
model-based analysis. The resulting framework, MuMMI_R, can
provide valuable insights into application-system interactions
and aid in the design of efficient parallel applications (with
respect to execution time, power requirements, and resilience),
runtime systems, and computer architectures. The key
outcomes include technical papers, a user-level dynamic power
capping library, and a large amount of experiment data for
the community.
This is a collaborative project between two
universities: University of Chicago and Illinois Institute of
Technology.
Team Members
- Valerie Taylor (UChicago, PI)
- Xingfu Wu (UChicago, co PI)
- Zhiling Lan (Illinois Tech, co PI)
- Sahil Sharma (BS, 1/2020-5/2021)
- Avery Peck (BS, 1/2020-12/2020)
- Boyang Li (PhD, 9/2019-12/2020)
- Xin Wang (PhD, 6/2020-8/2020)
- Melanie Cornelius (PhD, 8/2020-3/2021)
- Peixin Qiao (Ph.D., 2017-12/2019)
- Manqi Zhang (Ph.D., 2016-2017)
Key Publications
- X. Wu, V. Taylor, J. Cook, and P. Mucci, "Using
Performance-Power Modeling to Improve Energy Efficiency of HPC
Applications", IEEE Computer, Vol. 49, No. 10, pp.
20-29, Oct. 2016.
- X. Wu, V. Taylor and Z. Lan, "MuMMI_R: Analyzing and Modeling
Power and Time under Different Resilience Strategies", SC2016
Poster, 2016.
- X. Wu, V. Taylor, and Z. Lan, "Evaluating Runtime and Power
Requirements of Multilevel Checkpointing MPI Applications on
Four Parallel Architectures: An Empirical Case Study", Cray
User Group Conference, 2018.
- X. Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, and F.
Xia, "Performance, Power, and Scalability Analysis of the
Horovod Implementation of the CANDLE NT3 Benchmark on the Cray
XC40 Theta", SC18 Workshop on Python for High-Performance
and Scientific Computing, 2018.
- X. Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, and F.
Xia, "Performance, Energy, and Scalability Analysis and
Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks",
Proc. of ICPP'19, 2019.
- P. Qiao, Z. Lan, X. Wu, and V. Taylor, "Application Power
Pattern Characterization: Implications of Power Capping",
Technical Presentation at Argonne MCS, 2019.
- X. Wu, V. Taylor, Z. Lan, "Performance and Power Modeling and
Prediction Using MuMMI and Ten Machine Learning Methods", Cray
User Group Conference, 2020.
- Xingfu Wu, Aniruddha Marathe , Siddhartha Jana, Ondrej
Vysocky, Jophin John, Andrea Bartolini, Lubomir Riha,
Michael Gerndt, Valerie Taylor, and Sridutt Bhalachandra,
"Toward an End-to-End Auto-tuning Framework in HPC PowerStack",
Energy Efficient HPC State of Practice 2020 (EE HPC SOP 20),
Sep. 14-17, 2020, Kobe, Japan.
- Xingfu Wu and Valerie Taylor, "Utilizing Ensemble Learning for
Performance and Power Modeling and Improvement of Parallel
Cancer Deep Learning CANDLE Benchmarks", Concurrency and
Computation Practice and Experience, 2021, e6515,
https://doi.org/10.1002/cpe.6516.
- S. Sharma, Z. Lan, X. Wu, and V. Taylor, "A Dynamic Power
Capping Library for HPC Applications", IEEE Cluster (2-page
poster), 2021.
- S. Sharma, Z. Lan, X. Wu, and V. Taylor, "DNPC: a Dynamic
Node-level Power Capping Library for Scientific Applications", Undergraduate
Research Journal at Illinois Tech, Spring 2021.
- M. Cornelius, A. Peck, Z. Lan, W. Allcock, and B. Toonen, "A
Study of NPB and CANDLE on Commercial Off-the-Shelf
Disaggregated Memory", The 2nd Workshop on Resource
Disaggregation and Serverless (WORDS '21), co-located with
ASPLOS'21.
Software & Data
- The MuMMI_R
database [Link]
- The open-source dynamic power capping library called
DNPC [Link]
Contact
Valerie Taylor (vtaylor AT anl DOT gov)
Zhiling Lan (lan AT iit DOT edu)
Acknowledgement:
This project is supported by the US National Science
Foundation (CCF-1618776 and CCF-1801856). Note: Any
opinions, findings, and conclusions or recommendations
expressed in this material are those of the author(s)
and do not necessarily reflect the views of the National
Science Foundation.