A central technique for alignment is Reinforcement Learning from Human Feedback~(RLHF), which trains models by optimizing them against a reward signal derived from human preferences. While effective, this paradigm is susceptible to failure modes where models learn to maximize their reward score without genuinely adhering to the desired principles.
This thesis summarizes my investigation of critical vulnerabilities in current alignment methodologies, focusing on how models exploit unforeseen loopholes in evaluation and training frameworks. My work first demonstrates a pervasive issue in RLHF known as “reward hacking'', extending beyond simple verbosity. It reveals that prominent reward models and even human evaluators exhibit strong “format bias'', showing undue preference for superficial cues like lists, bolded text, links, and emojis. This study shows that LLMs can easily exploit this bias to achieve high scores on alignment benchmarks, often by merely manipulating their response format rather than improving the substantive quality of the content. This finding highlights a fundamental flaw in how we currently measure and reward
"good'' behavior in text-based models.
Furthermore, my work extends our inquiry beyond unimodal text generation to the burgeoning field of Omni-modality Language Models (OLMs). To probe the alignment of these more complex systems, we introduce OmnixR, a novel evaluation suite designed to test reasoning across a diverse mix of modalities, including text, audio, images, and video. The evaluation reveals that even state-of-the-art OLMs like GPT-4o and Gemini struggle significantly with tasks that require genuine cross-modal reasoning. These models exhibit unique biases and failure modes when forced to integrate information from multiple sources, indicating that alignment challenges are not only persistent but also evolve in complexity with model capabilities.
To address the vulnerabilities identified in RLHF, this thesis then proposes ODIN, a novel method designed to mitigate reward hacking. ODIN tackles the problem by training a two-head reward model that explicitly disentangles content quality from exploitable stylistic features, such as response length. By training one head to correlate with these features and another to be decorrelated from them, we can isolate a purer signal for content quality. During the reinforcement learning phase, the policy is optimized using only the decorrelated, quality-focused reward signal. Our experiments demonstrate that this approach effectively prevents the model from hacking the reward system through verbosity and other stylistic artifacts, resulting in better-aligned models that achieve high performance without resorting to superficial tricks.
Looking ahead, future work should focus on creating even more challenging and dynamic benchmarks that co-evolve with model capabilities to prevent benchmark overfitting, which can pave the way for more reliable AI systems.
Lichang Chen is a PhD candidate at University of Maryland, College Park. His research interests lie in AI alignment and reasoning, especially on building more advanced AI that can align with human intents and learn new tasks as quickly as generalist human can do. He has published over 15 papers on top venues, e.g., ICML, ICLR, ACL, EMNLP, NAACL; and his works accumulate over 2,000 citations during his PhD study. Lichang also interned at multimodal team@Google Research and GenAI/Science unit@Google Deepmind.