Despite their great predictive capability, Convolutional Neural Networks (CNNs) usually require a tremendous amount of annotated data at training time and are computational-expensive to deploy. When analyzing videos, it is very important and challenging to model temporal dynamics due to large appearance variation and complex semantics in videos. In this proposal, we propose methods to relieve the demand of human labor for data annotation and improve efficiency of model deployment in the object detection task. We also propose a temporal modeling architecture to better capture temporal dependencies for online action localization.
First, we introduce a generic framework that reduces the computational cost of object detection while retaining accuracy for scenarios where objects with varied sizes appear in high resolution images. Detection progresses in a coarse-to-fine manner, first on a down-sampled version of the image and then on a sequence of higher resolution regions identified as likely to improve the detection accuracy. Built upon reinforcement learning, our approach consists of a model (R-net) that uses coarse detection results to predict the potential accuracy gain for analyzing a region at a higher resolution and another model (Q-net) that sequentially selects regions to zoom in.
Second, we introduce count-guided weakly supervised localization (C-WSL), an approach that uses per-class object count as a new form of supervision to improve weakly supervised localization (WSL). C-WSL uses a simple count-based region selection algorithm to select high-quality regions, each of which covers a single object instance during training, and improves existing WSL methods by training with the selected regions. To demonstrate the effectiveness of C-WSL, we integrate it into two WSL architectures and conduct extensive experiments on VOC2007 and VOC2012.
Third, we propose Temporal Recurrent Network s (TRN) to model the temporal context of the target frame by combining online action detection and anticipation. TRN integrates the detection and anticipation into an unified end-to-end architecture. At testing time, it makes use of both accumulated historical evidence and anticipated future information to estimate the occurring actions.
Finally, we describe some future work directions for action localization and temporal modeling.
Dept. rep: Dr. Rama Chellappa
Members: Dr. Tom Goldstein