Progress in question answering (QA) is typically measured by how often models give correct answers, but the advent of Large Language Models (LLMs) has shifted focus to making QA responses more helpful; responses must not just have correct answers, but also reasoning chains with extra context for the answer, helping users reach their goals. These goals vary by task, like learning concepts, guiding problem-solving, or addressing tailored needs. Two key steps for making LLMs helpful in QA are: 1) training on preferences of what most users think are helpful; and 2) evaluating the correctness of LLM answers. This proposal argues both steps are poor proxies for true helpfulness. We first show answer correctness evaluations mask reasoning errors that limit helpfulness, then use these insights to refine preference training—building QA systems that are not just correct, but also support user goals.
Part I exposes how standard QA evaluations—which test if models correctly match predefined answers—fail to catch reasoning errors that degrade helpfulness. We show while LLMs excel at returning correct answers in QA, they are much weaker when using Process of Elimination (PoE), a logically inverse task. LLMs struggle to reason why answers are incorrect in PoE, showing these models fail to give adaptable reasoning chains. We then contrast standard QA—where models provide an answer for an input question—with a reverse QA (RQA) task, where models produce a question with a given answer. LLMs struggle with RQA and often generate questions that look complex but are in fact faulty, showing LLMs can produce responses that look helpful without truly being helpful.
Part II uses these insights to refine preference training—where LLMs learn responses most users think are helpful—resulting in more helpful QA systems in user-facing applications. We first develop a QA system that generates mnemonics: study aids for helping users learn the answer to vocabulary queries. Informed by RQA, we deploy our model's mnemonics in a learning app and find training on what users think best help them may not lead to models that aid user goals (e.g. learning). We show how combining these signals—what looks helpful and is truly helpful—can improve the overall quality of study aids from our QA system.
Finally, we propose two threads to refine preference training. First, we propose a preference scheme to discern plans that seem helpful and truly help humans solve complex questions. Next, to fix LLMs' adaptable reasoning errors in PoE, we propose a training method to learn reasons why users prefer responses—not just which response is preferred—aiding personalization. By advancing step-by-step and personalized reasoning, we aim to build QA systems that are not just correct, but also helpful.
Nishant Balepur is at Ph.D. student in Computer Science at the University of Maryland, advised by Professors Jordan Boyd-Graber and Rachel Rudinger. He conducts research in Natural Language Processing, specifically on making Large Language Models (LLMs) more helpful for users. His research has been recognized through a National Science Foundation Fellowship (NSF GRFP), a Cohere for AI Research Grant, and Best Paper Awards at MASC-SLL 2024 and 2025. His most recent work focuses on refining benchmark evaluations and LLM alignment with human preferences.