Talks

PhD Proposal: Mechanistic and Black-Box Interpretability for Attribution in Neural Networks

Sriram Balasubramanian

IRB 5105 https://umd.zoom.us/j/9500203024

Friday, November 7, 2025, 3:00-4:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

With the introduction of larger and more powerful neural networks each year, the inner workings of these systems get more and more opaque. The attribution problem for neural network models is concerned with identifying the specific parts of the input or model that are responsible for some specified model output or behavior. Many problems in interpretability such as constructing saliency maps, discovering semantically important directions in representation space, and finding sub-networks responsible for some model behavior can all be understood as specific instances of the attribution problem in its broadest sense. My research aims to attack the problem from both white-box (mechanistic) and black-box perspectives. From a mechanistic perspective, I have proposed a new masking method for CNNs which can enhance the fidelity of input attributions in downstream applications, as well as decomposition-based techniques for both model and input attributions in ViTs and ViT-CNN hybrids. From a black-box perspective, using gradient based methods, I have shown the existence of large connected regions in input space spanning distinct image classes that do not affect the model output. I have also investigated the usefulness of the chain-of-thought for input attribution in multimodal LLMs, showing significant reliability gaps even compared to text-only LLMs. I aim to continue my research on interpretability methods, with more focus on novel paradigms such as agentic systems and reasoning models.

Bio

Sriram Balasubramanian is a PhD student advised by Prof. Feizi in the Computer Science department at the University of Maryland. His research interests are centred around uncovering the hidden mechanisms in neural networks, with a focus on applying these insights to enhance trust in AI, modify model behaviour, and discover failure modes. He has also worked on other diverse topics such as detection of AI-generated text and understanding style mimicry in image generative AI. He holds a B.Tech (Hons) in computer science and engineering from the Indian Institute of Technology, Bombay and a MS in computer science from the University of Maryland.

This talk is organized by Migo Gui