Talks

Advancing Fully-Open Audio General Intelligence

Sreyan Ghosh

IRB 2207

Tuesday, March 10, 2026, 12:30-1:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

This talk traces the arc of our research in advancing audio intelligence, culminating in the development of Audio Flamingo 2, 3, Next and Music Flamingo. The talk will begin with our early contributions in representation learning and synthetic data generation that laid the foundation for robust audio-language models. Building on these pillars, we helped shape the future versions of Audio Flamingo, a family of fully open large-scale audio-language models capable of understanding speech, sounds, and music at unprecedented scale. I will conclude by highlighting how designing rigorous evaluation benchmarks (MMAU and MMAU-Pro) has been pivotal for progress in audio intelligence, and how expanding toward omni-modal intelligence can accelerate the shift from recognition to expert-level reasoning across audio and beyond.

Bio

Sreyan Ghosh is a Ph.D. student in Computer Science at the University of Maryland, College Park, advised by Professors Dinesh Manocha and Ramani Duraiswami. His research centers on advancing multimodal intelligence with a strong focus on audio -- encompassing speech, sounds, and music. His work spans novel neural architectures, synthetic data generation, enhanced audio representations, long-form audio understanding and reasoning, and comprehensive evaluation. He has interned at Adobe, Microsoft, and NVIDIA, and is a recipient of the NVIDIA Graduate Fellowship and the UMD Outstanding Graduate Assistant Award. He has been at NVIDIA since August 2024 and co-lead the development Audio Flamingo 2, 3, Next, Music Flamingo, and several other works in advancing audio intelligence.

This talk is organized by Samuel Malede Zewdu