Over the past few years, rapid advancements in Artificial Intelligence (AI) have achieved quantum leaps in performance across domains ranging from computer vision to natural language understanding. Given their widespread usage in safety-critical applications such as autonomous navigation and medical diagnosis, it is imperative to characterize their vulnerabilities and develop robust mitigation strategies. This thesis investigates the Safety, Robustness, and Reliability of AI through a taxonomy of vulnerabilities comprising three primary dimensions: oversensitivity to input perturbations, undersensitivity to semantic shifts, and structural limitations in generative reliability.
First, we investigate the phenomenon of oversensitivity in deep networks, where minor changes to the input result in disproportionately large and often catastrophic model failures. To address this in the vision domain, we develop Nuclear Curriculum Adversarial Training (NCAT), an efficient single-step training procedure to obtain models that are robust against a union of Lp threat models (L1, L2 and L-infinity). By introducing a curriculum schedule to mitigate catastrophic overfitting, we obtain the first L1 robust model trained via single-step adversaries, with performance comparable to multi-step methods. We further investigate oversensitivity in Large Language Models (LLMs) by introducing a fast beam-search based adversarial attack called BEAST, which can jailbreak standard LLMs in under one GPU-minute.
Second, we characterize the complementary problem of undersensitivity, wherein models maintain a near-uniform level of confidence despite large, perceptually significant changes in the input space. We present a novel Level Set Traversal (LST) algorithm that iteratively uses orthogonal components of the local gradient to identify the “blind spots” of common vision models. We study the geometry of level sets, and show that there exist linearly connected paths in input space, between images that a human oracle would deem to be extremely disparate, though vision models retain a near-uniform level of confidence on the same path.
Third, we investigate the detection of hallucinations in LLMs — outputs that are fallacious or fabricated, yet often appear plausible at first glance — using LLM-Check, an effective suite of techniques that only rely upon the internal hidden representations, attention similarity maps and logit outputs of an LLM. We demonstrate its efficacy over broad-ranging settings and diverse datasets: from zero-resource detection to cases where multiple model generations or external databases are made available at inference time, or with varying access restrictions to the original source LLM.
Gaurang Sriramanan is a fifth-year PhD student in Computer Science at the University of Maryland, College Park, where he is advised by Prof. Soheil Feizi. His research focuses on the safety, robustness and reliability of AI systems, by characterizing various failure modes and developing robust risk mitigation strategies. He holds a B.S. and M.Sc. in Mathematics from the Indian Institute of Science and an M.S. in Computer Science from the University of Maryland.
Examining Committee Chair: Dr. Soheil Feizi
Dean's Representative: Dr. Behtash Babadi
Committee Co-Chair
Members:
Dr. David Jacobs
Dr. Yizheng Chen
Dr. Hal Daumé

