PhD Proposal: PhD Preliminary: Communication-efficient Hybrid Parallel Algorithms for Neural Network Training
Siddharth Singh
IRB 5165
Abstract
The trend toward larger neural networks for improved generalization in deep learning has led to significant computational challenges, necessitating parallel training across multiple GPUs. However, communication overheads pose a bottleneck for scalability. This thesis proposes to address these challenges by developing AxoNN, a highly scalable parallel framework for training large neural networks. I propose a five-dimensional hybrid parallel algorithm optimized to minimize communication costs while maintaining user-friendliness. I also plan to develop a communication/performance model that will guide users to configurations with minimal communication volumes. The implementation of AxoNN will focus on maximizing overlap between computation and communication, thereby reducing GPU idle times. Additionally, I plan to develop a user-friendly version of the framework that aims to greatly simplify the task of parallelizing neural network training for practitioners. By striking a balance between usability and efficiency, AxoNN promises to advance parallel deep learning for large-scale neural networks.
Bio
Siddharth Singh is currently a fourth year Ph.D. student in Computer Science at the University of Maryland, College Park, having completed his B.Tech/M.Tech in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur. He is advised by Prof. Abhinav Bhatele. His research primarily focuses on the practical aspects of distributed training and inference for large neural networks. He has received the Outstanding Graduate Assistant Award for AY 2023-24. With research internships at Deepspeed (Microsoft) and Megatron-LM (Nvidia), Siddharth has gained hands-on experience in the field of parallel deep learning. He is planning to graduate in 2025 and will be looking for research scientist positions in the industry.
This talk is organized by Migo Gui