log in  |  register  |  feedback?  |  help  |  web accessibility
Train, Reason, and Act Under Value Function Guidance
Wednesday, April 29, 2026, 11:00 am-12:00 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract
Large language models (LLMs) have achieved remarkable progress, enabling them to tackle increasingly complex reasoning and decision-making tasks. A key driver of this success is reinforcement learning (RL) post-training, which aligns model behavior with human preferences. However, this improvement comes at a steep computational cost, as each chain-of-thought (CoT) generation demands substantial inference and training compute. To address this challenge, our recent work shows that value-function guidance—using learned value models as steering signals—can drive both more efficient training and more effective reasoning in LLMs.

This talk presents three complementary methods built around this principle. A*-PO accelerates policy optimization by precomputing optimal value estimates offline, cutting rollout and memory costs. Q# extends this idea through distributional RL, learning a regularized optimal Q-function that improves reasoning accuracy and provides convergence guarantees under KL-regularized objectives. Finally, Value-Guided Search (VGS) applies value-based reasoning at inference time, using token-level value models to guide generation instead of relying on costly process reward models or majority voting. Together, A⋆-PO, Q♯, and VGS demonstrate how value-function guidance can unify post-training and test-time reasoning, making LLM alignment faster, more stable, and significantly more compute-efficient.
 
Bio
Kianté Brantley is an Assistant Professor in the Kempner Institute and School of Engineering and Applied Sciences (SEAS) at Harvard University. He completed his Ph.D. in Computer Science at the University of Maryland College Park, advised by Dr. Hal Daumé III. After graduating, he completed his postdoctoral studies at Cornell University, working with Thorsten Joachims. His research focuses on problems at the intersection of machine learning and interactive decision-making, with the goal of improving the decision-making capabilities of foundation models. He has received several awards, including the ACM SIGHPC Computational and Data Science Fellowship, the Microsoft Dissertation Research Grant, and the NSF CIFellow Postdoctoral Fellowship.
This talk is organized by Wei Ai