COTA: a COoperative framework for Topology Awareness
As the number of computer nodes increases, so does the size of the
interconnect network. Historically, floating point was the most costly
component of a system, but this is no longer the case. Systems today, and
those anticipated in the future, are increasingly bound by their
communication infrastructure and the power dissipation associated with data
movement across the rapidly growing number of nodes. How to address the
increasing cost of data movement on ever-growing systems becomes critical.
This project develops a framework named COTA, a COoperative framework
for Topology Awareness. COTA is an integrated framework that
coordinates across the hardware, job scheduler, runtime, and application to
jointly attack the increasing concern of data movement for communication-
and power-efficiency on large-scale systems. Most importantly, the
framework supports topology awareness not only at job startup, but also
during job execution. The newly developed mapping algorithms,
topology-aware methods and tools, and topology-aware models provide a
critical foundation for the realization of topology awareness on current
and future systems. This research has a direct impact on system
productivity as well as a broad range of application domains that use
parallel systems for simulations. The project also enhances the
curriculum at Illinois Tech, broadens the participation
by underrepresented groups, and outreaches to the surrounding communities.
1-page poster summary of COTA is available:
poster'16 and
poster'17.
Faculty:
Zhiling Lan (PI, CS faculty)
Jia Wang (co PI, ECE faculty)
Graduate Students:
Xu Yang (CS Ph.D. student, now at Amazon) (2013-2017)
Xingwu Zheng (ECE Ph.D. student) (2013-2017)
Xin Wang (CS Ph.D. student) (2014-2017)
Manqi Zhang (CS Ph.D. studnet) (9/2016-12/2016)
Peixin Qiao (CS Ph.D. student) (8/2016-9/2017)
Zhou Zhou (CS Ph.D. student, now at Salesforce) (2012-2016)
Yuping Fan (ECE Master, now CS Ph.D. student) (2014-2016)
Ying Chen (CS Ph.D. student) (1/2016-7/2016)
Eduardo Berrocal (CS Ph.D. student) (2014)
Jianchao Yang (CS Master student, 10/2013-05/2014)
Qi Zhan (CS Master student, 10/2013-05/2014)
REU Students:
Arushi Rai (CS Undergraduate, 7/2017-9/2017)
(REU Report)
Blake Ehrenbek (CS Undergraduate, 7/2017-9/2017)
(REU Report)
Aleksandra Kukielko (CS undergraduate, 2016)
(REU Report)
Jia Hao He (CS undergraduate, 09/2015-12/2015)
Tarun N. Gidwani (CS undergraduate, 2014)
Asad Patel (CS undergraduate, 2014)
Collaborators:
Jingjin Wu at Univ. of Elect. Science & Tech., China
Xuangxing Xiong at Synopsys Inc.
Paul Rich, Vitali Morozov, John Jenkins, Misbah Mubarak, and Rob Ross at Argonne National Laboratory
Wei Tang and Narayan Desai at Google Inc.
Key Publications: [Link]
X. Yang, J. Jenkins, M.
Mubarak, R. Ross, and Z.Lan, "Watch Out for the Bully! Job Interference
Study on Dragonfly Network", Proc. of SC16 (acceptance rate is 18%),
2016.[PDF]
X. Zheng, Z. Zhou, X.
Yang, Z. Lan, and J. Wang, "Exploring Plan-Based Scheduling for Large-Scale
Computing Systems", Proc. of IEEE Cluster'16
(acceptance rate is 24%), 2016. [PDF]
Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang,
V. Morozov, and N. Desai, "Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus
Network Allocation Constraints", IEEE Transactions on Parallel
and Distributed Systems , 2016.
Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang,
J. Wang, and Z. Lan, "I/O Aware Job Scheduling and Bandwidth Allocation for Petascale Computing
Systems", Journal of Parallel Computing (ParCo),
2016. [PDF]
J. Wu, X. Xiong, E. Berrocal, J. Wang, and Z. Lan, "Topology Mapping of
Irregular Parallel Applications on Torus-Connected Supercomputers",
The Journal of Supercomputing, ,
2016.
J. Wu, X. Xiong, and Z. Lan, "Hierarchical Task Mapping for Parallel
Applications on Supercomputers", The Journal of Supercomputing, , 71(5):1776-1802, 2015.
X Yang, J. Jenkins, M. Mubarak, X. Wang, R. Ross, and Z. Lan,
"Study of Intra- and Inter-Job Interference on Torus Networks",
Proc. of ICPADS (The 22nd IEEE Intl Conf. on Parallel and Distributed Systems), ,
2016.
Z. Zhou, X. Yang, D. Zhao, P. Rich, W. Tang, J.
Wang, and Z. Lan, "I/O-aware Batch Scheduling for Petascale Computing Systems",
Proc. of Cluster'15, ,
2015.
Z. Zhou, X. Yang, Z. Lan, P. Rich, W. Tang, V. Morozov, and N. Desai, "Improving
Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints",
Proc. of IPDPS'15, ,
2015.
X. Yang, X. Zheng, Z. Zhou, W. Tang, J. Wang, and Z. Lan,
"Balancing Job Performance with System Performance via Locality-Aware Scheduling on Torus-Connected Systems",
Proc. of IEEE Cluster'14, ,
2014.[PDF]
X. Yang, Z. Zhou, S. Wallace, Z.
Lan, W. Tang, S. Coghlan, and M. Papka, "Integrating Dynamic Pricing of Electricity
into Energy Aware Scheduling for HPC Systems", Proc. of SC'13,
2013. [PDF]
Ph.D. Dissertations:
Xingwu Zheng, "Advanced Algorithms For HPC Job Scheduling" [Advisor:
Jia Wang, co-advisor: Zhiling Lan], Department of Electrical and Computer Engineering, Illinois Institute of Technology, November 2017.
Xu Yang, "Cooperative Batch Scheduling for HPC Systems" [Advisor:
Zhiling Lan], Department of Computer Science, Illinois Institute of Technology, April 2017.
Zhou Zhou, "Multi-Dimensional Batch scheduling Framework for
High-End Supercomputers" [Advisor:
Zhiling Lan], Department of Computer Science, Illinois Institute of Technology, December 2015.
Software Tools:
(Software) CQSim - a discrete event driven scheduling simulator.
[Link]
(Software) LibProfil - a light-weight user-transparent communication profiler.
[Link]
(Software) TOPOMap - two topology aware task mapping libraries. [Link]
Contact:
Dr. Zhiling Lan (lan AT iit DOT edu)
Acknowlegement:
This project is supported by the US National Science Foundation
(CNS-1320125).
Note: Any opinions, findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily reflect the views of the
National Science Foundation.