Talks

PhD Defense: Principled Frameworks for AI Alignment: From Post-Training to Inference

Souradip Chakraborty

Wednesday, April 15, 2026, 9:00-11:00 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Artificial intelligence (AI) is increasingly being deployed in high-stakes settings such as healthcare, robotics, defense, and law. As these systems become more capable and autonomous, it becomes essential to ensure that their behavior is aligned with human preferences. This challenge has made AI alignment a central problem in modern AI.

AI alignment can be broadly achieved through two fundamentally different paradigms: the first is i. Post-training alignment, where model parameters are updated after pretraining to better reflect desired behaviors; the second is ii. Inference-time alignment, where model behavior is steered at test time without modifying model parameters. While both paradigms aim to improve alignment, they present distinct challenges and opportunities.
This thesis advances AI alignment across both of these directions and is organized into two main parts:

Part I: Post-training AI Alignment :
This part focuses on improving alignment through parameter updates. Part I: Post-training AI Alignment: This part focuses on improving alignment through parameter updates. In particular, it addresses fundamental challenges in online alignment with human feedback, as well as the limitations of existing formulations in capturing diverse and conflicting human preferences. I. Distributional Mismatch in Online Alignment - We first identify a key limitation in RLHF - its inability to capture the entanglement between reward learning and policy optimization, leading to distribution shift and suboptimal alignment - and propose a novel bilevel alignment framework that explicitly models this interdependence, enabling more stable and theoretically grounded learning. Pluralistic Alignment with Diverse Preferences - We then study the problem of pluralistic alignment, showing that single-utility RLHF is fundamentally insufficient to represent diverse and conflicting preferences. To address this, we introduce MaxMin RLHF, inspired by principles from social choice theory, which ensures more equitable alignment across users. Together, these contributions provide a principled foundation for robust and inclusive post-training alignment.

Part 2: Inference-time AI Alignment :
In contrast to post-training methods, inference-time alignment enables flexible and efficient adaptation by directly steering the generation process at test time without updating model parameters, allowing for real-time personalization at low cost. This part develops a unified framework for controlling model behavior during decoding to achieve both efficiency and robustness. We first introduce Transfer Q*, a principled controlled decoding algorithm that leverages aligned base models to estimate optimal value functions for new tasks, enabling provably efficient and high-quality alignment. Building on this, we propose IMMUNE, which incorporates safety constraints directly into the decoding process to defend against jailbreak and adversarial prompts while preserving user intent. We further extend this paradigm to a multi-agent setting, where a mixture of specialized agents is coordinated via an implicit Q-function to enable adaptive policy switching and improved performance across diverse tasks. Finally, we move beyond standard instruction-tuned models to large reasoning models (LRMs) and investigate efficient test-time scaling strategies for improving their performance.

Together, these two parts provide a unified view of AI alignment: one direction improves alignment by modifying model parameters, while the other improves alignment by controlling how models are used at inference time. Across both settings, this thesis develops principled algorithms, theoretical insights, and practical methods for building AI systems that are more robust, adaptive, safe, and aligned with human goals

Bio

Souradip is a 5th Year PhD student in the Department of Computer Science at the University of Maryland, College Park, advised by Prof. Furong Huang, Prof. Dinesh Manocha. His research focuses on developing principled and scalable algorithms for aligning AI agents in adaptive environments, with the goal of making them safe, robust, and aligned with human behavior and preferences—thereby bridging the gap between theory and practice.
Souradip received the Outstanding Paper Award at AdvML-Frontiers (NeurIPS 2024) and TSRML (NeurIPS 2022), along with Outstanding Reviewer Awards at NeurIPS 2022, NeurIPS 2023, and AISTATS 2023. As a part of the PhD program, he has published in venues including ICML, Neurips, ICLR, AAAI, CoRL, ICRA.

Examining Committee Chair: Dr. Furong Huang

Dean's Representative: Dr. Nikhil Chopra

Committee Co-Chair：Dr. Dinesh Manocha

Members:

Dr. Bahar Asgari

Dr. Pratap Tokekar

Dr. Amrit Singh Bedi

This talk is organized by Migo Gui