Talks

PhD Defense: Designing and Evaluating Capable, Safe, and Trustworthy Generative AI Systems

Neel Jain

IRB-4105 https://umd.zoom.us/j/4106837750?pwd=ZEZxbG9UT2dHMTNhK0MxbDQwbGwvZz09&omn=98770465780&jst=2

Monday, April 13, 2026, 4:00-5:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Generative artificial intelligence (GenAI) systems have advanced rapidly throughout the 2020s, reshaping how people interact with technology in everyday life. As these systems are deployed more widely, it has become increasingly important to ensure that they are not only capable but also safe and trustworthy. Meeting this challenge requires progress across several dimensions, including robustness to adversarial misuse, trustworthy refusal behavior in non-adversarial settings, methods for evaluating models beyond conventional labeled benchmarks, and techniques for improving model capability to "chat'' under limited supervision. We investigate these challenges through a study of methods for designing and evaluating generative AI systems.

The first part examines the design of safe AI systems in adversarial settings. It studies discrete optimization methods for prompt-based attacks, with a focus on generating effective text inputs that expose model vulnerabilities. In this context, it introduces PEZ, an optimizer for generating adversarial prompts that also supports prompt optimization for broader applications such as image reconstruction and task specification. Building on this formulation, the dissertation presents a series of defenses against gradient-based prompt optimizers and discusses principled approaches for measuring attack success.

Then we turn to non-adversarial settings, focusing on the problem of controlling refusal behavior in language models. Because appropriate refusal behavior depends on user preferences, context, and demographic factors such as age, the ability to calibrate refusal behavior is an important aspect of trustworthy AI. Systems are more trustworthy when their boundaries can be adjusted transparently and predictably to suit different users and deployment contexts. To this end, we introduce meta tokens, referred to as refusal tokens, as a test-time mechanism for controlling specific categories of refusals, and investigate how to construct data that supports this fine-grained control.

We focus on both evaluation and adaptation in settings where conventional supervision is limited. It also examines model capabilities in low-data regimes, where supervised fine-tuning datasets are often small and imperfect, by introducing Noisy Embedding Fine-Tuning (NEFTune), a regularization method that substantially improves response quality and highlights the value of simple interventions for enhancing downstream performance. Finally, we address shortcomings in current benchmarking, as labeled benchmarks are often limited in scale and vulnerable to data contamination. To overcome these issues, it explores evaluation on unlabeled data in a self-supervised setting and introduces a framework for self-sensitivity evaluation, inspired by self-supervised learning, that measures the sensitivity and invariance of language models under transformations of the input text.

Bio

Neel Jain is a 5th-year Computer Science PhD candidate at the University of Maryland, College Park, working on all things LLM-related under the guidance of Professor Tom Goldstein. The majority of his work focuses on Safe and Trustworthy Machine Learning and holds multiple publications at top-conferences such as ICLR, Neurips, and COLM with over 2k citations. He holds a BA in Mathematics from Williams College and an M.S. in Computer Science from the University of Maryland. Neel will be joining Apple post-graduation in the Bay Area.

Examining Committee Chair: Dr. Tom Goldstein

Dean's Representative: Dr. Wojciech Czaja

Members:

Dr. Furong Huang

Dr. Abhinav Bhatele

Dr. Ramani Duraiswami

Dr. Katie Shilton

This talk is organized by Migo Gui