Abstract:
As AI agents are increasingly deployed in high-stakes applications, principled and robust alignment with human preferences becomes essential. This proposal advances the theoretical foundations of AI alignment by addressing three core challenges: (1) Distribution shift in online alignment, (2) Equitable alignment under preference diversity, and (3) Efficient personalization at inference time.
Bilevel RLHF: First, we address a critical issue in online RLHF—the failure to capture the entanglement between reward learning and policy optimization—leading to distribution shift and suboptimal alignment. We propose an efficient bilevel optimization framework that models this interdependence and ensures stable alignment with provable guarantees and improved empirical performance.
MaxMin RLHF: Second, we address the challenge of Pluralistic AI Alignment by deriving an impossibility result for single-utility RLHF, showing its limitations in representing diverse human preferences. To provide an equitable solution, we propose MaxMin RLHF inspired by the Egalitarian principle in social choice theory to better represent diverse human preferences.
Transfer Q*: Finally, we address efficient inference-time alignment for real-time personalization without costly fine-tuning. We propose Transfer Q*, a principled controlled decoding algorithm that uses aligned base models to estimate optimal value functions for new rewards, provably efficient and high-quality alignment with effective personalization on real-world tasks.
Together, these contributions provide a principled foundation for building safe, scalable, and fair alignment systems.
Souradip is a 4th Year PhD student in the Department of Computer Science at the University of Maryland, College Park, advised by Prof. Furong Huang, Prof. Dinesh Manocha. His research focuses on developing principled and scalable algorithms for aligning AI agents in adaptive environments, with the goal of making them safe, robust, and aligned with human behavior and preferences—thereby bridging the gap between theory and practice.
Currently a part-time Student Researcher at Google, Souradip received the Outstanding Paper Award, TSRML at Neurips2022 and Outstanding Reviewer Awards, Neurips 2022, Neurips 2023, AISTATS 2023. As a part of the PhD program, he has published in venues including ICML, Neurips, ICLR, AAAI, CoRL, ICRA.