Talks

Towards Auditory General Intelligence

Ramani Duraiswami

IRB 0318 (Gannon) or https://umd.zoom.us/j/93754397716?pwd=GuzthRJybpRS8HOidKRoXWcFV7sC4c.1

Friday, October 24, 2025, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Perception of audio events, music and speech plays a fundamental role in human interaction with the world. Auditory perception is equally vital for an animal's life and survival. I will first briefly review fundamental problems in computational audition, which, along with the rest of this talk, will also be the subject of my spring 2026 course, CMSC848U: Computational Audition.

Large language models have absorbed vast amounts of knowledge, and the scientific community is rapidly working to leverage this in a variety of domains—far beyond chat or coding. However, language models currently lag in auditory scene understanding, speech and nonspeech voiced communication, and music analysis, all of which are central facets of human intelligence.Over the past three years, Large Audio Language Models (LALMs)—which process audio inputs via text queries—have grown increasingly capable. These models are built by creating shared representations for language and audio, and fine-tuning language models with supervised learning. I will discuss active research in this area, including significant contributions from our group at UMD, particularly Sreyan Ghosh and Sonal Kumar (co-advised by Prof. Manocha); and from our summer project at the JSALT 2025 Workshop. I will present an overview of our models: COMPA (ICLR 2024, spotlight), GAMA (EMNLP 2024, oral), Audio Flamingo 2 (ICML 2025), Audio Flamingo 3 (NeurIPS 2025, spotlight), Music Flamingo (submitted; arXiv). AF3 and MF are the leading open-source models for audio, speech and music understanding. I will describe the training process and highlight open research questions, such as extending these models to multichannel and spatial audio, and incorporating reinforcement learning (RL).

Benchmarking has been crucial for LLM development, yet benchmarks for LALMs were absent. Our group created MMAU (ICLR 2025, spotlight), the first comprehensive audio benchmark, now widely used for evaluating LALMs. To create an even more comprehensive benchmark, we enlisted numerous experts at JSALT 2025 and developed MMAUPro (arXiv 2025). I will conclude with thoughts on advancing foundation models for audio and other physical signal domains.

Bio

Ramani Duraiswami is a Professor in the Department of Computer Science at the University of Maryland, College Park, and directs the Perceptual Interfaces and Reality Lab (PIRL). He has joint appointments in the Artificial Intelligence Institute (AIM), UMIACS, Electrical Engineering, Robotics, Neural and Cognitive Sciences, and Applied Math and Scientific Computing programs. Prof. Duraiswami holds a B. Tech. from IIT Bombay and a Ph.D. from The Johns Hopkins University. His research encompasses machine learning, fast and stable scientific computing algorithms, and computational perception. Two companies have spun out of his research, and his lab’s audio engine powers millions of VR headsets, PCs, and headphones worldwide.

This talk is organized by Samuel Malede Zewdu