The past decade has witnessed remarkable progress in computer vision and robotics, driven by deep learning. However, a fundamental challenge remains: how to efficiently align and establish correspondences across different domains and modalities. In this talk, we introduce novel frameworks for learning structured alignment, progressing from visual understanding to robot control.
First, we present our work SCorrSAN for learning pixel-level semantic alignment in image pairs using sparse key point supervision through a teacher-student framework. Next, we highlight our work PointVIS and UVIS on instance-level alignment in videos, where we achieve high-quality video instance segmentation results using only point or without any supervision. Then, transitioning to robotics, we introduce ARDuP, a method for video-based policy learning that aligns generated visual plans with language instructions for effective control.
Finally, we will discuss proposed future research directions. These include aligning agent behavior with human preferences under noisy feedback and learning dense point alignment to enhance video action recognition.
Shuaiyi Huang is a PhD student in the Department of Computer Science at University of Maryland. She works with her advisor, Prof. Abhinav Shrivastava. Her research interests lie at the intersection of computer vision and robot learning, with a focus on visual correspondence, video understanding, and vision-guided robot control. Previously, she interned at NVIDIA.