In the rapidly advancing domain of artificial intelligence, integrating heterogeneous modalities, particularly audio and visual streams, has become essential for robust, context-aware understanding. Our work advances multimodal foundation models through computationally efficient and semantically aligned architectures capable of handling the complexity of real-world audio-visual tasks.
We introduce novel task formulations that require joint localization, interpretation, and synthesis of multimodal information under diverse conditions. Our models tightly couple fine-grained audio-visual grounding with adaptive reasoning mechanisms, improving temporal alignment, semantic consistency, and robustness to modality-specific noise. To address the lack of standardized evaluation, we design comprehensive benchmarks with tailored protocols and metrics, enabling rigorous and fair comparisons.
In this proposal, we also explore this direction by developing efficient multimodal models that combine parameter-efficient adaptation with computation-aware inference. Our approaches integrate lightweight adaptation techniques with policies that selectively process modalities, enabling models to retain high accuracy while significantly reducing training and deployment costs.
Sanjoy Chowdhury is a Ph.D. student in Computer Science at the University of Maryland, College Park, advised by Professor Dinesh Manocha and Professor Ruohan Gao. His research focuses on multimodal machine learning, computer vision, and audio-visual large language models, developing systems that combine robust perception with generative modeling for tasks such as grounding, reasoning, and cross-modal synthesis. He has interned at Meta Reality Labs, Google DeepMind, Apple Research, and Adobe Research, and has also been a visiting researcher at KAUST and MBZUAI. Some of his recent efforts have been towards developing resource-efficient and trustworthy systems for multi-speaker AVLLMs, egocentric perception, multi-modal conditioned audio generation, and facial expression synthesis. Sanjoy’s work aims to advance reliable, interpretable, and creative AI systems capable of understanding and generating rich, multisensory experiences.