log in  |  register  |  feedback?  |  help  |  web accessibility
Communication-Efficient Heterogeneity-Aware Machine Learning System and Architecture
Friday, December 6, 2019, 10:00-11:00 am Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)

The key success of deep learning is the increasing size of models that can achieve high accuracy. At the same time, it is difficult to train the complex models with large data sets. Therefore, it is crucial to accelerate training with distributed systems and architectures, where communication and heterogeneity are two key challenges. In this talk, I will present two heterogeneity-aware decentralized training protocols without communication bottleneck. Specifically, Hop supports arbitrary iteration gap between workers by novel queue-based synchronization which can tolerate heterogeneity with system techniques. Prague uses randomized communication to tolerate heterogeneity with a new training algorithm based on partial reduce —— an efficient communication primitive. Moreover, I will present the systematic tensor partitioning for training on heterogeneous accelerator arrays (e.g., GPU/TPU). We believe that our principled approaches are crucial for achieving high-performance and efficient distributed training.


Xuehai Qian is an assistant professor at University of Southern California. His research interests include domain-specific systems and architectures, performance tuning and resource management of cloud systems, and parallel computer architectures. He got his Ph.D from University of Illinois Urbana Champaign and was a postdoc at UC Berkeley. He is the recipient of W.J Poppelbaum Memorial Award at UIUC, NSF CRII and CAREER Award, and the inaugural ACSIC (American Chinese Scholar In Computing) Rising Star Award.

This talk is organized by Sharron McElroy