Talks

PhD Proposal: Long-term Temporal Modeling for Video Action Understanding

Xitong Yang

Virtual: https://umd.zoom.us/j/8640167417

Monday, April 13, 2020, 10:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

The tremendous growth in video data, both on the internet and in real life, has encouraged the development of intelligent systems that can automatically analyze video contents and understand human actions. Inspired by the success of deep convolutional neural networks (CNNs) on image understanding, many efforts have been made to extend the deep networks to video understanding by modeling both spatial and temporal information. Compared to still image analysis, the temporal component of videos provides an additional and important clue for action recognition, as a number of actions can only be distinguished when motion information is taken into account. However, the temporal information also brings new challenges for the networks to perform effective temporal modeling, especially for the semantic dynamics that covers a long-range time scale.

In this proposal, we focus on the development of effective temporal modeling methods for improved video action understanding. More specifically, we propose two approaches to model long-term temporal information for action recognition: (1) learning hierarchical motion representation (e.g. from lower-level motion to higher-level motion) through a multi-scale self-supervised learning framework; (2) integrating temporal relational reasoning into models through a decoupled version of the non-local neural networks. We also propose a progressive learning framework for spatio-temporal action detection in videos, which can naturally handle the large spatial displacement of human boxes due to long sequences or rapid movement of actors. Finally, we will discuss some future work on modeling long-term temporal structures and reasoning about spatio-temporal relationships.

Examining Committee:

Chair: Dr. Larry S. Davis
Dept rep: Dr. Furong Huang
Members: Dr. Abhinav Shrivastava

Bio

Xitong Yang is a fourth year Ph.D. student in CS at University of Maryland, College Park, under the supervision of Prof. Larry S. Davis. His research focuses on deep learning based video understanding, including video action recognition, detection and retrieval.

This talk is organized by Tom Hurst