Talks

How does gradient descent work?

Jeremy Cohen

Remote

Friday, October 10, 2025, 12:00-1:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Join Zoom Meeting: https://umd.zoom.us/j/6615193287?pwd=VC9jZ0EyVmtPK0xuVU9pUEpGVG5EZz09
Meeting ID: 661 519 3287
Passcode: yeyX37

Optimization is the engine of deep learning, yet the theory of optimization has had little impact on the practice of deep learning. Why? In this talk, we will first show that traditional theories of optimization cannot explain the convergence of the simplest optimization algorithm — deterministic gradient descent — in deep learning. Whereas traditional theories assert that gradient descent converges because the curvature of the loss landscape is “a priori” small, we will explain how in reality, gradient descent converges because it *dynamically avoids* high-curvature regions of the loss landscape. Understanding this behavior requires Taylor expanding to third order, which is one order higher than normally used in optimization theory. While the “fine-grained” dynamics of gradient descent involve chaotic oscillations that are difficult to analyze, we will demonstrate that the “time-averaged” dynamics are, fortunately, much more tractable. We will present an analysis of these time-averaged dynamics that yields highly accurate quantitative predictions in a variety of deep learning settings. Since gradient descent is the simplest optimization algorithm, we hope this analysis can help point the way towards a mathematical theory of optimization in deep learning.

Bio

Jeremy Cohen is a research fellow at the Flatiron Institute. He has recently been working on understanding optimization in deep learning. He obtained his PhD in 2024 from Carnegie Mellon University, advised by Zico Kolter and Ameet Talwalkar.

This talk is organized by Migo Gui