log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
Learning Spatio-Temporal Representations for Video Understanding
Du Tran
https://umd.zoom.us/j/94543765116?pwd=clY3MVV5Z1g4T2xpdnJMdjFiMFhYdz09
Monday, February 22, 2021, 1:00-2:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Video understanding is one of the fundamental problems in computer vision with various applications, including autonomous vehicles, robot learning, and visual perception. Compared with traditional image understanding, video understanding: (i) has higher model complexity and requires to learn from a much larger amount of data; (ii) requires more expensive annotations; (iii) and sometimes demands multimodal modeling, e.g., audiovisual modeling instead of visual only. In this talk, I will present some of our approaches addressing these challenges, such as efficient and scalable spatiotemporal learning, cross-modal self-supervised learning of video and audio representations, and multimodal learning. Finally, I will outline several potential future research directions in this area.

Bio

Du Tran is a staff research scientist at Facebook AI. He graduated with a Ph.D. in computer science from Dartmouth College and an M.S. in computer science from the University of Illinois at Urbana-Champaign, receiving the Dartmouth Presidential Fellowship and the Vietnam Education Fellowship. His research interests are in computer vision, machine learning, and computer graphics, with specific interests in video understanding, representation learning, and multimodal modeling. His work on C3D was instrumental in steering the field towards the widespread adoption of 3D CNNs as the model of choice for video analysis. His video understanding architectures have been deployed in production at Facebook, to process hundreds of millions of videos daily for various tasks, including video classification, violence prediction, and advertisement ranking.

This talk is organized by Richa Mathur