Talks

PhD Defense: Learning Structured Alignment: From Visual Understanding to Robot Control

Shuaiyi Huang

IRB-4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Friday, June 20, 2025, 10:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The past decade has witnessed remarkable progress in computer vision and robotics, driven by deep learning. However, a fundamental challenge remains: how to align and establish correspondences across different domains and modalities efficiently and effectively. In this thesis, we introduce novel frameworks for learning structured alignment, progressing from visual understanding to robot control.

In the first part of the talk, I will focus on how we advance visual understanding through structured alignment in perception tasks. First, I will present our work SCorrSAN for learning pixel-level semantic alignment in image pairs using sparse key point supervision through a teacher-student framework. Next, we highlight our work PointVIS on instance-level alignment in videos, where we achieve high-quality video instance segmentation results using only point-level supervision. Finally, I will present our work Trokens, where long-range pixel-level correspondences enable better modeling of complex motion for video action recognition.

Building on these advances in visual understanding, we next explore how structured alignment can be leveraged to drive effective decision-making and control in robotics. In the second part of the talk, I will first present ARDuP, a novel method for video-based policy learning that aligns generated visual plans with language instructions for effective control. Finally, I will present TREND, which aligns agent behavior with preferences under noisy feedback through a tri-teaching framework for robust reward learning.

Bio

Shuaiyi Huang is a PhD student in the Department of Computer Science at University of Maryland. She works with her advisor, Prof. Abhinav Shrivastava. Her research interests lie at the intersection of computer vision and robot learning, with a focus on visual correspondence, video understanding, and vision-guided robot control. Previously, she interned at NVIDIA.

This talk is organized by Migo Gui