Audio-visual perception is important to develop intelligent systems capable of understanding, interacting with, and reasoning about the real world. Over the past few years, there has been a major paradigm shift from specialized audio-visual models toward powerful, general-purpose multimodal foundation models and audio-visual large language models. These models offer unprecedented flexibility, generalization ability, and task adaptability. Despite their impressive breadth, current foundation models still exhibit important limitations in audio-visual grounding, robustness, and adaptability, highlighting the need for further advances. In particular, they struggle with fine-grained grounding of multimodal signals, remain fragile under uncertainty and complex reasoning conditions, and impose substantial computational costs that limit adaptation and long-term use. In this dissertation, we contribute toward enhancing how foundation models look, listen, and reason across audio-visual environments.
Advancing Fine-Grained Capabilities in Audio-Visual Foundation Models. We enhance the fine-grained perceptual capacity of foundation models, enabling precise spatial grounding of sound to specific visual entities, temporally selective reasoning over complex events, and detailed multimodal alignment beyond coarse semantic understanding. We develop multimodal grounding mechanisms, instance-level alignment strategies, and conditioning pipelines that integrate visual structure, temporal dynamics, and acoustic cues inside large foundational architectures. These contributions improve the fine-grained perceptual capabilities of modern multimodal systems by enabling more accurate spatial grounding, instance-level association, and temporally selective reasoning. Across multiple benchmark datasets and evaluation settings, our proposed methods yield consistent gains in localization accuracy (up to 30.18% gain in IoU score), temporal retrieval metrics, and downstream task performance (up to 37.12% gains in BLEU scores), resulting in richer and more informative audio-visual understanding.
Strengthening Robust and Trustworthy Reasoning. We improve the robustness of audio-visual foundation models by developing methods, benchmarks, and reasoning frameworks that systematically evaluate and reinforce reliability, where reliability is assessed through accuracy under controlled degradations, stability of reasoning under modality conflict, calibration of trust across modalities, and reduction in failure rates and hallucinations across diverse real-world conditions. We design principled evaluation suites that stress-test models under modality conflict, ambiguous or misleading cues, multi-speaker conversational complexity, and human-centered reasoning scenarios. Beyond evaluation, we develop reliability-aware reasoning processes, self-critique and refinement strategies, and uncertainty-guided inference that stabilize decision making without requiring expensive model retraining. Empirically, these advances lead to higher answer accuracy under noise and perturbations, improved consistency across complex queries, better temporal grounding and step-wise reasoning metrics, and reduced hallucination rates compared to baseline AVLLMs, enabling models not only to process multimodal input, but to reason more consistently across challenging scenarios (up to 39.52% gain in Top@1 accuracy).
Enabling Efficient and Adaptive Models. We also advance the efficiency of multimodal foundation models to make them practical for large-scale and long-term deployment. We propose approaches that significantly reduce the computational overhead of adaptation, personalization, and continuous learning, allowing models to evolve with user-specific egocentric experiences and changing environments. This includes lightweight adaptation modules, selective updating strategies, scalable multimodal pipelines, and resource-conscious system designs that retain high performance while dramatically improving usability. In practice, our approaches reduce adaptation and inference cost in terms of floating-point operations, latency, and trainable parameter count, while maintaining or improving task accuracy. These capabilities improve the practicality of modern multimodal models by enabling more efficient adaptation and personalization and supporting longer-term operation with flexible and adaptive model behavior over time (up to 90% reduction in GMACs, 82% reduction in parameters).
Together, these contributions demonstrate that fine-grained perception, robustness, and efficiency yield measurable improvements across practical audio-visual tasks. We show consistent gains in spatial grounding, source attribution, temporally selective retrieval, multimodal reasoning, multi-speaker understanding, and egocentric perception. Quantitatively, our proposed methods improve localization and attribution metrics (e.g., IoU and source identification accuracy), retrieval performance (e.g., Recall@K and mAP), robustness under noise and modality conflict (e.g., higher accuracy and better calibration under controlled degradations), and reduce hallucination rates, while qualitatively producing more coherent and evidence-grounded reasoning. Efficiency-focused contributions further lower computational and adaptation cost while preserving or improving task performance, supporting more flexible and adaptive operation. Overall, these results demonstrate concrete progress in strengthening the practical capabilities of modern audio-visual foundation models.
Sanjoy Chowdhury is a Ph.D. student in Computer Science at the University of Maryland, College Park, advised by Professor Dinesh Manocha and Professor Ruohan Gao. His research focuses on multimodal machine learning, computer vision, and audio-visual large language models, developing systems that combine robust perception with generative modeling for tasks such as grounding, reasoning, and cross-modal synthesis. He has interned at Meta Reality Labs, Google DeepMind, Apple Research, and Adobe Research, and has also been a visiting researcher at KAUST and MBZUAI. Some of his recent efforts have been towards developing resource-efficient and trustworthy systems for multi-speaker AVLLMs, egocentric perception, multi-modal conditioned audio generation, and facial expression synthesis. Sanjoy’s work aims to advance reliable, interpretable, and creative AI systems capable of understanding and generating rich, multisensory experiences.

