This talk traces the arc of our research in advancing audio intelligence, culminating in the development of Audio Flamingo 2, 3, Next and Music Flamingo. The talk will begin with our early contributions in representation learning and synthetic data generation that laid the foundation for robust audio-language models. Building on these pillars, we helped shape the future versions of Audio Flamingo, a family of fully open large-scale audio-language models capable of understanding speech, sounds, and music at unprecedented scale. I will conclude by highlighting how designing rigorous evaluation benchmarks (MMAU and MMAU-Pro) has been pivotal for progress in audio intelligence, and how expanding toward omni-modal intelligence can accelerate the shift from recognition to expert-level reasoning across audio and beyond.
Sreyan Ghosh is a Ph.D. student in Computer Science at the University of Maryland, College Park, advised by Professors Dinesh Manocha and Ramani Duraiswami. His research centers on advancing multimodal intelligence with a strong focus on audio -- encompassing speech, sounds, and music. His work spans novel neural architectures, synthetic data generation, enhanced audio representations, long-form audio understanding and reasoning, and comprehensive evaluation. He has interned at Adobe, Microsoft, and NVIDIA, and is a recipient of the NVIDIA Graduate Fellowship and the UMD Outstanding Graduate Assistant Award. He has been at NVIDIA since August 2024 and co-lead the development Audio Flamingo 2, 3, Next, Music Flamingo, and several other works in advancing audio intelligence.

