Large Language Models (LLMs) have brought remarkable advancements to the computing industry. However, a high barrier exists between the LLMs and the vast majority of researchers and practitioners, brought by the engineering challenges with the enormous model sizes and the substantial compute requirements. In this talk, I’ll discuss my research on system innovations to democratize LLMs, which includes (1) Alpa and AlpaServe, the first system to automate model-parallel training and accelerate serving with model parallelism, and (2) vLLM, a high-throughput and memory-efficient serving engine for large language models, accelerated with PagedAttention. I will conclude by presenting the short-term research challenges and long-term trends in LLM systems.
Zhuohan Li is a final-year CS PhD student at UC Berkeley, where he is advised by Prof. Ion Stoica. He is interested in designing and building efficient machine learning systems. Recently, he has been focusing on the training and serving of large models, specifically LLMs. His works include Alpa, AlpaServe, Vicuna, and vLLM (PagedAttention). Most noticeably, vLLM has been the most popular open-source LLM serving system in the world and has been widely used and deployed across the industry. His work has been selected in the first cohort of the a16z open-source AI grant.