Deep learning has made significant advancements across various fields, driven by increasingly larger neural networks and massive datasets. However, these improvements come at the cost of high computational demands, necessitating the use of thousands of GPUs operating in parallel for extreme-scale model training. At such scales, the overheads associated with inter-GPU communication become a major bottleneck, severely limiting efficient hardware resource utilization.
This thesis addresses the challenge of optimizing communication in large-scale parallel deep learning. First, it introduces a novel four-dimensional hybrid parallel algorithm designed to minimize communication overhead while maintaining ease of use for practitioners. Second, it presents a topology-aware communication model that identifies optimal configurations for this algorithm based on the hardware architecture, improving efficiency and scalability. Finally, the thesis develops highly scalable implementations of collective communication primitives commonly used in distributed deep learning, further enhancing performance. Put together, these optimizations enable us to efficiently scale LLM training to more than 16000 GPUs on the Frontier supercomputer, and clock a significantly high throughput of nearly 1.4 exaFLOP/s.
Siddharth Singh is a fifth-year Ph.D. candidate in Computer Science at the University of Maryland, College Park. He earned his B.Tech and M.Tech in Computer Science and Engineering from the Indian Institute of Technology, Kharagpur. Advised by Prof. Abhinav Bhatele, his research focuses on the practical aspects of distributed training and inference for large neural networks. In the 2023-24 academic year, he received the Outstanding Graduate Research Assistant Award and led a team to the finals of the ACM Gordon Bell Competition.