“If a picture is worth a thousand words, what is a video worth?” Video information plays a crucial role in conveying information due to its richness and efficiency compared to language. However, processing video data presents significant challenges, including how to focus on the important frames, domain shifts, lack of reasoning and attention ability, the semantic gap between language queries and visual content, high computational costs, and etc. The rapid advancements in computer vision have underscored the importance of effective and efficient video understanding to solve those challenges in a variety of applications, from autonomous systems to human-computer interaction. At the core of these advancements lie four critical pillars: dataset development, preprocessing, visual reasoning mechanisms, and multimodal alignment. These interconnected components drive the capabilities of AI systems to interpret, reason about, and align visual data with semantic information, enabling transformative breakthroughs in visual perception tasks.
High-quality datasets serve as the foundational building blocks, providing diverse, comprehensive, and representative data to train models capable of handling real-world complexity. The development of such datasets not only enhances model's performance but also ensures fairness, inclusivity, and robustness against adversarial scenarios. We proposed DAVE and METEOR to introduce unstructured video datasets to advance effective video understanding. Complementing datasets, preprocessing, reasoning and multimodal alignment are also very essential for effective and efficient video understanding. For preprocessing, we proposed MITFAS and AZTR with sampling and auto-cropping to enable efficient understanding. Within each frame, advances in visual reasoning empower models to go beyond surface-level pattern recognition, enabling nuanced understanding, contextual inference, and adaptive focus on salient visual elements. We proposed SCP and ICAR to enhance the model's reasoning ability so as to further enable more effective video understanding. Furthermore, multimodal alignment bridges the gap between visual data and natural language, a critical step for applications like image captioning, visual question answering, and multimodal dialogue systems. This alignment relies on harmonizing the representation spaces of vision and language to facilitate seamless integration and contextual understanding across modalities. We proposed ViLA that addresses both efficient frame sampling and effective cross-modal alignment in a unified way. By synergizing these areas, we aim to create AI applications that are not only effective and efficient but also capable of reasoning and aligning across complex multimodal landscapes with human-like proficiency.
Xijun Wang is a PhD student in Computer Science at the University of Maryland, where he is advised by Prof. Dinesh Manocha and Prof. Ming Lin. His research interests focus on Fundamental Model Design for Computer Vision, especially for Video Understanding and Vision-Language Models.