log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Defense: Efficient Detection of Objects and Faces with Deep Learning
Mahyar Najibi
Tuesday, November 19, 2019, 12:00-2:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Object detection is a fundamental problem in computer vision and is an essential building block for many applications such as autonomous driving, visual search, and object tracking. Given its large-scale and real-time applications, scalable training and fast inference are critical. Deep neural networks, although powerful in visual recognition, can be computationally expensive. Besides, they introduce shortcomings such as lack of scale-invariance and inaccurate predictions in crowded scenes that can affect detection. This dissertation studies the intrinsic problems which emerge when deep convolutional neural networks are used for object and face detection. We introduce methods to overcome these issues which are not only accurate but also efficient.

First, we focus on the problem of scale-invariance. Performing inference on a multi-scale image pyramid, although effective, increases computation noticeably. Moreover, multi-scale inference really blooms when the model is also trained using expensive multi-scale approaches.  As a result, we start by introducing an efficient multi-scale training algorithm called "SNIPER" (Scale Normalization for Image Pyramids with Efficient Re-sampling). Based on the ground-truth annotations, SNIPER sparsely samples high-resolution image regions wherever needed. In contrast to training, at inference, there is no ground-truth information to guide region sampling. Thus, we propose "AutoFocus". AutoFocus predicts regions to be zoomed-in from low resolutions at inference time, making it possible to skip a large portion of the input pyramid. While being as efficient as single-scale detectors, these methods boost performance noticeably.

Second, we study the problem of efficient face detection. Compared to generic objects, faces are rigid and crowded scenes containing hundreds of faces with extreme scales are more common. In this dissertation, we present "SSH" (Single Stage Headless Face Detector). A method that unlike two-stage localization/classification detectors, performs both tasks in a single stage, efficiently models scale variation by design, and removes most of the parameters from its underlying network, but still achieves state-of-the-art results on challenging benchmarks. Furthermore, for the two-stage detection paradigm, we introduce "FA-RPN" (Floating Anchor Region Proposal Network). FA-RPN takes the spatial structure of faces into account and allows modification of the prediction density during inference to efficiently deal with crowded scenes.

Finally, we turn our attention to the first step in two-stage localization/classification detectors. While neural networks were deployed for the classification, localization was previously solved using classic algorithms which became the bottleneck. To remedy, we propose "G-CNN" which models localization as a search in the space of all possible bounding boxes and deploys the same neural network used for classification. Furthermore, for tasks such as saliency detection, where the number of predictions is typically small, we develop an alternative approach that runs at speeds close to 120 frames/second.

Examining Committee: 
                           Chair:              Dr. Larry Davis
                          Dean's rep:      Dr. Rama Chellappa
                          Members:        Dr. David Jacobs
                                                    Dr. John Dickerson
                                               Dr. Tom Goldstein

Mahyar Najibi is a Ph.D. candidate in the Computer Science department at the University of Maryland, where he works with Prof. Larry Davis. His research interests lie in the intersection of computer vision and machine learning. In particular, his focus is on efficient object detection, face detection, and adversarial machine learning.

This talk is organized by Tom Hurst