Talks

Architecture-Tailored Parallelization for Accessible Large Model Era

Xupeng Miao

IRB 4105 or https://umd.zoom.us/j/95853135696?pwd=VVEwMVpxeElXeEw0ckVlSWNOMVhXdz09

Monday, April 1, 2024, 1:00-2:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

In this talk, I will introduce my work on machine learning (ML) parallelization, a critical endeavor to bridge the significant gap between diverse ML programs and multitiered computing architectures. Specifically, I will explore ML parallelization at three distinct yet interconnected levels. First, I will show that by leveraging the unexplored space of model partitioning strategies, distributed ML training can be up to 20x faster than existing systems by improving communication efficiency. I will highlight some innovative distributed ML systems, such as HET for sparse embedding models and Galvatron for dense Transformer models, respectively. Second, I will discuss how to improve GPU utilization through ML parallelization. I will present SpecInfer, a system that reduces large language model (LLM) serving latency by 1.5-3.5x compared to existing systems by leveraging a novel tree-based speculative inference and verification mechanism. Third, I will demonstrate how ML parallelization popularizes LLMs by extending its boundaries throughout inter-cloud environments. I will describe SpotServe, the first LLM serving system on spot instances, handling preemptions with dynamic reparallelization, ensuring relatively low tail latency, and reducing monetary cost by 54%. Finally, I will conclude with a discussion on pushing my research forward to a holistic and unified infrastructure for democratizing ML.

Bio

Xupeng Miao is currently a postdoc researcher at Carnegie Mellon University working with Prof. Zhihao Jia and Prof. Tianqi Chen. Before that, he received his Ph.D. degree from Peking University advised by Prof. Bin Cui. He is broadly interested in machine learning systems, data management, and distributed computing. His research has resulted in 30+ publications (with 13 first-authored papers) in top-tier conferences, including OSDI, ASPLOS, SIGMOD, VLDB, NSDI, NeurIPS and so on. Recently, he has focused on building efficient, scalable, and affordable software systems (e.g., FlexFlow Serve) for large language models. His work was recognized through the 2022 ACM China Doctoral Dissertation Award and the Best Scalable Data Science Paper Award of VLDB 2022.

This talk is organized by Samuel Malede Zewdu