Talks

Empowering Large Language Models with Efficient and Automated Systems

Zhuohan Li

IRB 4105 or https://umd.zoom.us/j/95853135696?pwd=VVEwMVpxeElXeEw0ckVlSWNOMVhXdz09

Thursday, March 28, 2024, 1:00-2:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Large Language Models (LLMs) have brought remarkable advancements to the computing industry. However, a high barrier exists between the LLMs and the vast majority of researchers and practitioners, brought by the engineering challenges with the enormous model sizes and the substantial compute requirements. In this talk, I’ll discuss my research on system innovations to democratize LLMs, which includes (1) Alpa and AlpaServe, the first system to automate model-parallel training and accelerate serving with model parallelism, and (2) vLLM, a high-throughput and memory-efficient serving engine for large language models, accelerated with PagedAttention. I will conclude by presenting the short-term research challenges and long-term trends in LLM systems.

Bio

Zhuohan Li is a final-year CS PhD student at UC Berkeley, where he is advised by Prof. Ion Stoica. He is interested in designing and building efficient machine learning systems. Recently, he has been focusing on the training and serving of large models, specifically LLMs. His works include Alpa, AlpaServe, Vicuna, and vLLM (PagedAttention). Most noticeably, vLLM has been the most popular open-source LLM serving system in the world and has been widely used and deployed across the industry. His work has been selected in the first cohort of the a16z open-source AI grant.

This talk is organized by Samuel Malede Zewdu