Talks

PhD Proposal: Advancing Audio Intelligence for Perception, Reasoning, and Generation

Sonal Kumar

https://umd.zoom.us/j/3073988210?pwd=OUpOeXhRN05ueXBwZ0JMNkRPbWZ6Zz09&omn=93052355706

Monday, November 24, 2025, 1:30-2:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Audio - spanning speech, music, and environmental sound - is central to perception yet remains underused in AI, limiting truly multimodal systems. With the rise of Large Audio Language Models (LALMs) that unify perception, reasoning, and generation in a single architecture, replacing the need for distinct models for distinct models for different foundational tasks like ASR, captioning, etc and enabling tasks from Question-Answering to controllable audio generation.

In this talk, I present my research to date, which addresses several of these bottlenecks through novel models, datasets, and task formulations. I also outline future directions aimed at developing more robust, generalizable, and unified reasoning and generation capable audio-language models.

Bio

Sonal Kumar is a third-year Ph.D. student in CS at UMD. His research focuses on improving audio understanding and generation. His publications have been published at ACL, ICLR, ICML, InterSpeech, ICASSP, WASPAA, etc. He has worked as a research intern with Adobe and Google and has collaborated on multiple projects with NVIDIA.

This talk is organized by Migo Gui