Talks

Challenges in AI-assisted AI evaluation

4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Wednesday, October 9, 2024, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

AIs are being deployed to solve increasingly complex problems, and reliable human oversight becomes a huge challenge: AIs are also getting better at producing outputs that look correct to humans but are in fact subtly flawed. To support effective oversight, approaches like debate, constitutional AI, and reward modeling all involve using AIs to assist human evaluators. Although promising, these approaches can create new risks, as AIs are being used to evaluate themselves. In this talk, I will discuss three failure modes in both AI-assisted evaluation and training with flawed supervision. I will also discuss preliminary work on mitigating these risks.

Bio

Shi is an assistant professor at the George Washington University. He received his PhD at UMD supervised by Jordan Boyd-Graber. He did postdocs at UChicago with Chenhao Tan and NYU in the Alignment Research Group with Sam Bowman and He He. Shi works on AI safety, in particular scalable oversight, as an extension of his work on human-AI collaboration, interpretability eval, and adversarial robustness. His most recent work focuses on a meta-evaluation of risks in scalable oversight methods and evaluations.

This talk is organized by Naomi Feldman