Talks

Thinking Outside the GPU: Systems for Scalable Machine Learning Pipelines

Mark Zhao

IRB 4105 or https://umd.zoom.us/j/94340703410?pwd=rrXaGSXSpabcMTtDNmeCNf2Ih2fQYE.1

Tuesday, February 11, 2025, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Scalable and efficient machine learning (ML) systems have been instrumental in fueling recent advancements in ML capabilities. However, further scaling these systems requires more than simply increasing the performance and quantity of accelerators such as GPUs. Modern ML deployments rely on complex pipelines composed of many diverse and interconnected systems beyond just accelerators.

In this talk, I will emphasize the importance of building scalable systems across the entire ML pipeline. In particular, I will first explore how to build scalable data storage and ingestion systems to manage massive datasets for large-scale ML training pipelines, including those at Meta. To meet growing ML data demands, these data systems must be optimized for performance and efficiency. I will next illustrate how to leverage synergistic optimizations across the training data pipeline to unlock performance and efficiency gains beyond what isolated system optimizations can achieve. However, effectively deploying these optimizations requires navigating a complex system design space. To address this, I will then introduce cedar, a framework that automates these optimizations and orchestrates ML data processing for diverse training workloads. Finally, I will discuss key opportunities in further advancing the scalability, security, and capabilities of the systems that will drive the next generations of ML training and inference pipelines.

Bio

Mark is a final-year Ph.D. candidate at Stanford University, where he is advised by Christos Kozyrakis. His research builds systems for end-to-end machine learning deployments by leveraging tools across the computing stack, including computer systems, computer architecture, security, databases, and machine learning. He has received an IEEE S&P Distinguished Practical Paper Award, a Top Pick in Hardware and Embedded Security Award, and an MLCommons ML and Systems Rising Star Award. His work is generously supported by a Stanford Graduate Fellowship and a Meta Ph.D. Fellowship in AI System SW/HW Co-Design. His website is at https://web.stanford.edu/~myzhao/.

This talk is organized by Samuel Malede Zewdu