We have witnessed the advances in many aspects of computer vision achieved by Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs). However, despite their impressive predictive capability, DNNs are usually considered "heavy" in terms of the number of parameters and computational cost they have, and it leads to two major challenges: first, the training and deployment of deep networks are expensive; second, without tremendous annotated training data, which are very costful to obtain, DNNs easily suffer over-fitting and have poor generalization. In this proposal, we propose approaches to tackling these two challenges in specific computer vision problems to improve efficiency and generalization.
First, network pruning using neuron importance score propagation. To reduce the significant redundancy in DNNs, we formulate network pruning as a binary integer optimization problem which minimizes the reconstruction errors on the final responses produced by the network, and derive a closed-form solution to it for pruning neurons in earlier layers. Based on our theoretical analysis, we propose the Neuron Importance Score Propagation (NISP) algorithm to propagate the importance scores of final responses to every neuron in the network, then prune neurons in the entire networks jointly.
Second, visual relationship detection (VRD) with linguistic knowledge distillation. Since the semantic space of visual relationships is huge and training data is limited, especially for long-tail relationships that have few instances, detecting visual relationships from images is a challenging problem. To improve the predictive capability, especially generalization on unseen relationships, we utilize knowledge of linguistic statistics obtained from both training annotations (internal knowledge) and publicly available text, e.g., Wikipedia (external knowledge) to regularize visual model learning.
Third, efficient relevant motion event detection for large-scale home surveillance videos. To detect motion events of objects-of-interest from large scale home surveillance videos, traditional methods based on object detection and tracking are extremely slow and require expensive GPU devices. To dramatically speedup relevant motion event detection and improve its performance, we propose a novel network for relevant motion event detection, ReMotENet, which is a unified, end-to-end data-driven method using spatial-temporal attention-based 3D ConvNets to jointly model the appearance and motion of objects-of-interest in a video.
Forth, the role of context selection in object detection. We investigate the reasons why context in object detection has limited utility by isolating and evaluating the predictive power of different context cues under ideal conditions in which context provided by an oracle. Based on this study, we propose a region-based context re-scoring method with dynamic context selection to remove noise and emphasize informative context.
In the end, we will discuss some future research directions to improve efficiency and generalization of visual recognition.
Dean's rep: Dr. Thomas Goldstein
Members: Dr. Rama Chellappa