Software - ZHILING LAN

CQSim: trace-based, event-driven scheduling simulator. [github link]
Note: if you use CQSim in your work, please cite the paper : X. Yang, Z. Zhou, S. Wallace, Z. Lan, W. Tang, S. Coghlan, and M. Papka, "Integrating Dynamic Pricing of Electricity into Energy Aware Scheduling for HPC Systems", Proc. of SC'13, 2013. [PDF]
The repo contains a branch called DRAS (Deep Reinforcement Learning Agent for HPC scheduling). If you use CQSim/DRAS in your work, please cite the paper: Y. Fan, T. Childers, P. Rich, W. Allcock, M. Papka, and Z. Lan, "Deep Reinforcement Agent for Scheduling in HPC", Proc. of IPDPS'21, 2021. [PDF]

Union: workload manager for integrating coNCePTual (a network correctness and performance testing language) as an online workload for networking simulator CODES. [github link]
Note: if you use Union in your work, please cite the paper : X. Wang, M. Mubarak, Y. Kang, R. Ross, and Z. Lan, "Union: An Automatic Workload Manager for Accelerating Network Simulations", Proc. of IPDPS, 2020. [PDF]

Q-adaptive: multi-agent reinforcement learning based routing for Dragonfly networks [github link]
Note: if you use Q-adaptive/SST in your work, please cite the paper : Yao Kang, Xin Wang, and Zhiling Lan. "Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network". [PDF]

MonEQ: application-level power profiling library for applications running on IBM Blue Gene/Q. [MonEQ BG/Q v1.1 release ]
Note: if you use MonEQ in your work, please cite the paper : S. Wallace, V. Vishwanath, S. Coghlan, Z. Lan, and M. Papka, "Profilling Benchmarks on IBM Blue Gene/Q", Proc. of IEEE Cluster'13, 2013. [MonEQ paper]

CODES : flit-level, event-driven simulation toolkit for distributed system simulations. My team is in collaboration with ANL and RPI on the CODES networking simulation part. [link]
Note: the team has two interesting papers on HPC interconnect network simulation:
X. Wang, M. Mubarak, X. Yang, R. Ross, and Z. Lan, "Trade-off Study of Localizing Communication and Balancing Network Traffic on Dragonfly System", Proc. of IPDPS'18 , 2018. [PDF]
X. Yang, J. Jenkins, M. Mubarak, R. Ross, and Z. Lan, "Watch Out for the Bully! Job Interference Study on Dragonfly Network", Proc. of SC16 , 2016.[PDF]

DNPC: dynamic power capping library for HPC applications [github link]
Note: if you use DNPC in your work, please cite the paper : Sahil Sharma, Zhiling Lan, Xingfu Wu, and Valerie Taylor, “A Dynamic Power Capping Library for HPC Applications”, IEEE Cluster 2021 (2-page research poster) [PDF]

TOPPER: a system-level tradeoff modeling tool for quantitative analysis of performance, power, and resilience on extreme scale systems. It is built on top of CPN Tools, a tool for editing, simulating, and analyzing colored Petri nets. [link]
Note: if you use TOPPER in your work, please cite the paper : L. Yu, Z. Zhou, Y. Fan, M.E, Papka, and Z. Lan, "Sytem-Wide Tradeoff Modeling of Performance, Power, and Resilience on Petascale Systems", Journal of Supercomputing, 2018.[PDF]

PuPPET: a power-performance modeling tool for predictive analysis of power management on extreme scale systems. It is built on top of CPN Tools. [link]
Note: if you use PuPPET in your work, please cite the paper : L. Yu, Z. Zhou, S. Wallace, M.E, Papka, and Z. Lan, "Quantitative Modeling of Power Performance Tradeoffs on Extreme Scale Systems", Journal of Parallel and Distributed Computing, 2015.[PDF]

TopoMap: a suite of user-level library for effective topology-aware task mapping of MPI applications. Currently, it supports InfiniBand-connected supercomputers, Cray XT5, and IBM Blue Gene/P systems. [github link]
Note: if you use TopoMap in your work, please cite the papers : (1) J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, "Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers", The Journal of Supercomputing, 2016. (2) J. Wu, X. Xiong, and Z. Lan, "Hierarchical Task Mapping for Parallel Applications on Supercomputers", The Journal of Supercomputing, 71(5):1776-1802, 2015.

LibProfil: this is a light-weight MPI profilling and tracing library intended to discover the topology of MPI applications. It can be used with TopoMap. [github link]
Note: if you use TopoMap in your work, please cite the paper : J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, "Topology Mapping of Irregular Parallel Applications on Torus-Connected Supercomputers", The Journal of Supercomputing, 2016.

Anomaly Detection Tool SND: a non-parametrics anomaly detection tool. [link]
Note: if you use the tool in your work, please cite the paper : L. Yu and Z. Lan, "A Scalable, Non-Parametric Anomaly Detection Method for Large Scale Computing", IEEE Transactions on Parallel and Distributed Systems, 2016. [PDF]

QSim: an event-driven job scheduling simulator for Cobalt. [github link]
Note: if you use the tool in your work, please cite the paper : W. Tang, N. Desai, D. Buettner, and Z. Lan, "Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on Blue Gene/P", Proc. of IPDPS'10 [Best Paper Award] , 2010. [PDF]

Public Data at ALCF [ALCF link]
Note: if you use the data in your work, please acknowledge the Argonne Leadership Computing Facility and cite the following paper : W. Allcock, P. Rich, Y. Fan, and Z. Lan, "Experience and Practice of Batch Scheduling on Leadership Supercomputers at Argonne", Proc. of the 21st Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP), held in conjunction with IPDPS, 2017. [PDF]

Workload Trace from Intrepid at ALCF [link]
Note: if you use the data in your work, please cite the following paper : W. Tang, Z. Lan, N. Desai, D. Buettner, and Y. Yu, "Reducing Fragmentation on Torus-Conneected Supercomputers", Proc. IEEE Intl. Parallel & Distributed Processing Symp., pp. 828--839, May 2011. [PDF]

RAS Log from Intrepid at ALCF [link]
Note: if you use the data in your work, please cite the following paper : Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner, ''Co-Analysis of RAS Log and Job Log on Blue Gene/P,'' in Proc. of IEEE International Parallel & Distributed Processing Symposium (IPDPS'11), Anchorage, AK, USA, 2011. [PDF]

SysDP: an automated fault diagnosis and prognosis software toolkit for large-scale systems. It has been tested with RAS (Reliability, Availability, and Serviceability) logs from Blue Gene systems.

FT-Pro: an application-level adaptive fault tolerance system for parallel applications. Here, "application-level" means the focus is on reducing application completion time in the presence of failure. It allows applications to avoid anticipated failures via preventive migration, and in the case of unforeseeable failures, to minimize their impact through selective checkpointing. It is implemented with the MPICH-V checkpointing package.

FARS: a Fault-Aware Runtime System for system-level adaptive fault tolerance. Here, "system-level" means the primary goal is to improve system productivity in the presence of failure. It not only includes runtime strategies to allocate spare nodes for failure avoindance, but also provides a general mechanism to select running jobs for rescheduling in case of resource contention. An event-driven simulator is developed to emulate computing systems using batch scheduler enhanced with FARS. It has been tested with both synthetic data and machine traces collected from production systems.

FREM: a Fast REstart Mechanism to improve process recovery for general checkpoint/restart protocols. The core idea is to enable early process restart on partial checkpoint image by tracking data access patterns after each checkpoint. A prototype system which implements FREM with the BLCR checkpointing tool is developed. We have tested it with SPEC 2006. [FERM paper on IEEE Trans. on Computers]

ParaDLB and DistDLB: Dynamic load balancing methods for large-scale applications using the structured adaptive mesh refinement (SAMR) algorithm. The methods have been implemented and tested in the cosmological simulation code ENZO.

SWS (Seismic Wave Simulation): a seismic wave simulation package using finite element method. It can be used not only for theoretical studies of seismic waive propagation, but also for engineers engaged in seismic data acquisition, processing, interpretation and use of the inversion. The tool can be used to numerically solve any combination of acoustic wave equation, isotropic and anisotropic elastic wave equation, two-phase media wave equation. It was developed by Zhiling Lan and Xiumin Shao at Chinese Academia of Sciences during 1993-1997. The software was purchased and used by China National Petroleum Corporation.