In this proposal, we focus on the development of effective temporal modeling methods for improved video action understanding. More specifically, we propose two approaches to model long-term temporal information for action recognition: (1) learning hierarchical motion representation (e.g. from lower-level motion to higher-level motion) through a multi-scale self-supervised learning framework; (2) integrating temporal relational reasoning into models through a decoupled version of the non-local neural networks. We also propose a progressive learning framework for spatio-temporal action detection in videos, which can naturally handle the large spatial displacement of human boxes due to long sequences or rapid movement of actors. Finally, we will discuss some future work on modeling long-term temporal structures and reasoning about spatio-temporal relationships.
Dept rep: Dr. Furong Huang
Members: Dr. Abhinav Shrivastava
Xitong Yang is a fourth year Ph.D. student in CS at University of Maryland, College Park, under the supervision of Prof. Larry S. Davis. His research focuses on deep learning based video understanding, including video action recognition, detection and retrieval.