Significant advances in computer architecture (popularity of accelerators such as GPGPUs) and parallel computing (scalable libraries for dense and sparse linear algebra) have contributed to the on-going AI revolution. In particular, distributed LLM training relies on scalable matrix multiplication algorithms and efficient communication on high-speed interconnects. Pre-training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters and graph neural network (GNNs) on extremely large graphs requires hundreds to tens of thousands of GPUs. However, such training often suffers from significant scaling bottlenecks such as high communication overheads and load imbalance.
Abhinav Bhatele is an associate professor in the Department of Computer Science, and director of the Parallel Software and Systems Group at the University of Maryland, College Park. His research interests are broadly in systems and AI, with a focus on parallel computing and distributed AI. He has published research in parallel programming models and runtimes, network design and simulation, applications of machine learning to parallel systems, parallel deep learning, and on analyzing/visualizing, modeling and optimizing the performance of parallel software and systems. Abhinav has received best paper awards at Euro-Par 2009, IPDPS 2013, IPDPS 2016, and PDP 2024, and a best poster award at SC 2023. He was selected as a recipient of the IEEE TCSC Award for Excellence in Scalable Computing (Early Career) in 2014, the LLNL Early and Mid-Career Recognition award in 2018, the NSF CAREER award in 2021, the IEEE TCSC Award for Excellence in Scalable Computing (Middle Career) in 2023, and the UIUC CS Early Career Academic Achievement Alumni Award in 2024.

