Audio - spanning speech, music, and environmental sound - is central to perception yet remains underused in AI, limiting truly multimodal systems. With the rise of Large Audio Language Models (LALMs) that unify perception, reasoning, and generation in a single architecture, replacing the need for distinct models for distinct models for different foundational tasks like ASR, captioning, etc and enabling tasks from Question-Answering to controllable audio generation.
In this talk, I present my research to date, which addresses several of these bottlenecks through novel models, datasets, and task formulations. I also outline future directions aimed at developing more robust, generalizable, and unified reasoning and generation capable audio-language models.
Sonal Kumar is a third-year Ph.D. student in CS at UMD. His research focuses on improving audio understanding and generation. His publications have been published at ACL, ICLR, ICML, InterSpeech, ICASSP, WASPAA, etc. He has worked as a research intern with Adobe and Google and has collaborated on multiple projects with NVIDIA.

