The past decade has seen exponential growth in the size and complexity of cloud computing. A key design choice during this expansion has been the widespread reliance on heuristics—fast, empirically effective methods that scale well. However, heuristics can have unknown corner cases where their performance can be arbitrarily bad. This unpredictability increases the risk of unavailability and performance loss, especially at scale, where extreme cases are more likely to occur.
In this talk, I will discuss two research directions to limit these risks and enhance the performance and availability at scale. In the first part, I will focus on developing scalable algorithms with formal guarantees, which eliminate the need for heuristics and the associated risks. However, there are times when developing such methods is not possible, and we have to rely on heuristics to scale. In the second part, I will introduce general methods to quantify the risks of using heuristics and to deploy workarounds.
I will conclude by highlighting new research directions that my work opens up, ranging from the performance analysis of learning-enabled systems to online heuristic analysis and heuristic synthesis using foundation models.
Pooria Namyar is a Ph.D. candidate at the University of Southern California working with Prof. Ramesh Govindan. His work combines theory and systems to improve the performance and availability of large-scale networks and systems. Pooria received the Google Ph.D. fellowship in Networking (2024) and was recognized as a Rising Star in Machine Learning and Systems (2024) and an MHI Scholar (2023).
His research has had direct industry impacts: His work on max-min fair resource allocation, Soroush, has been deployed in Microsoft’s traffic engineering pipeline; Firefly, his work on clock synchronization, is set to be deployed at Google soon; and his heuristic analysis tools have uncovered and addressed inefficiencies in production heuristics.