Talks

On Mitigating Congestion in High Performance Networks

Abhinav Bhatele - Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

Thursday, February 7, 2019, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

High performance networks enable fast communication between compute nodes on large clusters and supercomputers. Even so, many parallel programs spend a significant fraction of their execution time performing communication (process-to-process messages, filesystem reads/writes, etc.) on these networks. This is due to the sharing of network resources among different traffic classes and among concurrently running programs (jobs), which leads to network congestion, and as a result, run-to-run performance variability and performance degradation of individual programs (jobs). No satisfactory solutions yet exist for mitigating such performance degradation on systems that allow jobs to share the network.

In this talk, I will present two novel algorithms to mitigate congestion on high performance networks by minimizing sharing of network links among jobs. The first algorithm is a new resource allocation policy used by the job scheduler on fat-tree network based systems to assign "isolated" node partitions to individual jobs. These isolated partitions prevent multiple jobs from sharing the same network links, and as a result, completely eliminate inter-job network interference. The second algorithm is a new adaptive routing algorithm that considers link congestion arising from overlapping network flows of multiple jobs. Our new adaptive flow-aware routing (AFAR) algorithm implements a greedy heuristic to migrate some flows from heavily congested network links to those with low network traffic. I will also present a brief overview of my research and plans for future research.

Bio

Abhinav Bhatele is a computer scientist in the Center for Applied Scientific Computing at Lawrence Livermore National Laboratory. His research interests are broadly in systems and networks, with a focus on parallel computing and big data analytics. He has published research in programming models and runtimes, network design and simulation, applications of machine learning to parallel systems, and on analyzing, modeling and optimizing the performance of parallel software and systems.

Abhinav received a B.Tech. degree in Computer Science and Engineering from I.I.T. Kanpur, India in May 2005, and M.S. and Ph.D. degrees in Computer Science from the University of Illinois at Urbana-Champaign in 2007 and 2010 respectively. Abhinav was an ACM-IEEE CS George Michael Memorial HPC Fellow in 2009. He has received best paper awards at Euro-Par 2009, IPDPS 2013 and IPDPS 2016. Abhinav was selected as one of the recipients of the IEEE TCSC Young Achievers in Scalable Computing award in 2014, and the LLNL Early and Mid-Career Recognition award in 2018.

This talk is organized by Brandi Adams