log in  |  register  |  feedback?  |  help  |  web accessibility
Statistical Challenges in Modern Machine Learning and their Algorithmic Consequences
Yeshwanth Cherapanamjeri
IRB 4105 or https://umd.zoom.us/j/94340703410?pwd=rrXaGSXSpabcMTtDNmeCNf2Ih2fQYE.1
Wednesday, February 19, 2025, 11:00 am-12:00 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

The success of modern machine learning is driven, in part, by the availability of large-scale datasets. However, their immense scale also makes the effective curation of such datasets challenging. Many classical estimators, developed under the assumption of clean, well-behaved data, fare poorly when deployed in these settings. This unfortunate scenario raises several statistical as well as algorithmic challenges: What are the statistical limits of estimation in these settings, and can they be realized computationally efficiently?

In this talk, I will compare and contrast the task of addressing these challenges in two natural, complementary settings: the first featuring extreme noise and the second, extreme bias.

In the first setting, we consider the problem of estimation with heavy-tailed data, where recent work has produced estimators achieving optimal statistical performance. However, these solutions are computationally impractical, and their analysis is tailored to the specific problem of interest. I will present a simple algorithmic framework that has resulted in state-of-the-art estimators for a broad class of heavy-tailed estimation problems.

Next, I consider the complementary setting of extreme bias under the classical Roy model of self-selection bias, where bias arises due to the strategic behavior of the data-generating agents. I will describe algorithmic approaches to counteract this bias, yielding the first statistically and computationally efficient estimators in this setting.

Finally, I will conclude the talk with future directions targeting the construction of good datasets when the data is drawn from a diverse and heterogeneous range of sources with varying quality and quantity.

Bio
Yeshwanth is a postdoctoral researcher at MIT where he is mentored by Constantinos Daskalakis. Previously, he completed his Ph.D at UC Berkeley under the guidance of Peter Bartlett. Yeshwanth is interested in statistical and algorithmic challenges that arise in the modern practice of machine learning. These include settings with extreme amounts of noise or bias, missing or partially observed data, and more recently, the impact of dataset construction on statistical performance.
This talk is organized by Samuel Malede Zewdu