Talks

PhD Defense: Advancing Audio Processing in the Age of Large Language Models

Sreyan Ghosh

Monday, April 20, 2026, 3:30-5:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Understanding audio, encompassing speech, non-speech sounds, and music, is fundamental for AI systems to interact effectively with the world, yet audio processing has historically lagged behind language and vision due to data scarcity, limited architectures, and the inherent complexity of auditory signals. Recent advances in Large Language Models (LLMs) have begun to bridge this gap, demonstrating promising capabilities in tasks ranging from Automatic Speech Recognition and audio captioning to open-ended question answering and complex reasoning. My dissertation advances audio processing in the age of LLMs through contributions in open model development, scalable data curation, robust audio representations, long-form understanding, expert-level evaluation, and omni-modal reasoning.

In this talk, I will present the Audio Flamingo series, a family of fully open large audio-language models we develop with novel architectures, training curricula, and internet-scale data curation strategies, including over 1 million hours of carefully curated audio of varying lengths, paired with skill-specific question-answer pairs, that achieve state-of-the-art results across over 20 benchmarks, surpassing both open-weight and closed-source models. I will discuss unified audio encoders such as AF-CLAP and AF-Whisper, trained on over 8 million audio-caption pairs, that bridge speech, sound, and music representation learning, and describe how we extend audio understanding from short clips to 30-minute contexts through new datasets, temporally grounded reasoning paradigms, and scaled training infrastructure. I will present Music Flamingo, which achieves expert-level music understanding through theory-grounded chain-of-thought reasoning and reinforcement learning, and UALM, which unifies audio understanding, generation, and reasoning within a single model. I will introduce expert-level benchmarks such as MMAU and MMAU-Pro, spanning over 10,000 and 5,000 annotated instances respectively, that reveal significant gaps between current models and human-level audio reasoning, with even the best models achieving only ~75% on MMAU where humans reach ~82%, and just ~58% on the more challenging MMAU-Pro.

Finally, I will present MMOU, a large-scale benchmark of 15,000 QA pairs over 9,000 long-form real-world videos, where we demonstrate that audio intelligence is foundational -- not peripheral -- to video understanding, with even the best proprietary systems falling over 20 points short of human performance. Motivated by this gap, I will present Audio-Visual Flamingo, a fully open audio-visual language model we design to enable temporally grounded reasoning over long and complex real-world videos by jointly integrating audio and visual streams.

Bio

Sreyan Ghosh is a PhD student in Computer Science at the University of Maryland, College Park, advised by Professor Dinesh Manocha and Professor Ramani Duraiswami. His research focuses on advancing audio and speech understanding, particularly with Large Language Models. His work spans neural architectures, synthetic data generation, improved audio representations, and long-form audio understanding and reasoning. He has interned at Adobe, Microsoft, and NVIDIA and is a recipient of the NVIDIA Graduate Fellowship and the Outstanding Graduate Assistant Award.

Examining Committee Chair: Dr. Dinesh Manocha

Dean's Representative: Dr. Maria K. Cameron

Committee Co-Chair: Dr. Ramani Duraiswami

Members:

Dr. Nirupam Roy

Dr. Shinji Watanabe

This talk is organized by Migo Gui