Understanding actions in video requires reasoning about how visual content evolves over time. While appearance provides important context about objects and scenes, motion reveals how those entities interact and change, which is central to interpreting actions. Despite this, many modern video understanding models remain appearance-centric, treating motion as an implicit signal and becoming brittle in low-data settings, under viewpoint changes, and when fine-grained temporal reasoning is required.
Motivated by these limitations, this talk revisits motion as a first-class representation for video understanding. Central to this perspective is point tracking, which provides sparse yet persistent correspondences across time and offers a natural way to represent motion explicitly. Building on this idea, I will present trajectory-based representations that align appearance features along tracked points for data-efficient few-shot action recognition, and show how explicit modeling of structured motion patterns further strengthens action understanding.
I will then discuss how motion representations extend from two-dimensional image space to three-dimensional scene space using monocular 3D point tracking, enabling geometry-aware modeling under camera motion and viewpoint changes. Finally, I will describe how explicit motion representations can be integrated into video-language models, which often process videos as collections of frames with limited temporal structure, to improve temporal reasoning and motion-centric multimodal understanding.
Overall, this talk argues for a shift toward video representations where motion is explicitly structured and central to understanding actions in video.
Pulkit Kumar is a PhD student in Computer Science at the University of Maryland, advised by Prof. Abhinav Shrivastava. His research focuses on motion-centric video understanding, leveraging point tracking and trajectory-based representations for efficient action recognition and multimodal vision–language models.

