First, we focus on the problem of scale-invariance. Performing inference on a multi-scale image pyramid, although effective, increases computation noticeably. Moreover, multi-scale inference really blooms when the model is also trained using expensive multi-scale approaches. As a result, we start by introducing an efficient multi-scale training algorithm called "SNIPER" (Scale Normalization for Image Pyramids with Efficient Re-sampling). Based on the ground-truth annotations, SNIPER sparsely samples high-resolution image regions wherever needed. In contrast to training, at inference, there is no ground-truth information to guide region sampling. Thus, we propose "AutoFocus". AutoFocus predicts regions to be zoomed-in from low resolutions at inference time, making it possible to skip a large portion of the input pyramid. While being as efficient as single-scale detectors, these methods boost performance noticeably.
Second, we study the problem of efficient face detection. Compared to generic objects, faces are rigid and crowded scenes containing hundreds of faces with extreme scales are more common. In this dissertation, we present "SSH" (Single Stage Headless Face Detector). A method that unlike two-stage localization/classification detectors, performs both tasks in a single stage, efficiently models scale variation by design, and removes most of the parameters from its underlying network, but still achieves state-of-the-art results on challenging benchmarks. Furthermore, for the two-stage detection paradigm, we introduce "FA-RPN" (Floating Anchor Region Proposal Network). FA-RPN takes the spatial structure of faces into account and allows modification of the prediction density during inference to efficiently deal with crowded scenes.
Finally, we turn our attention to the first step in two-stage localization/classification detectors. While neural networks were deployed for the classification, localization was previously solved using classic algorithms which became the bottleneck. To remedy, we propose "G-CNN" which models localization as a search in the space of all possible bounding boxes and deploys the same neural network used for classification. Furthermore, for tasks such as saliency detection, where the number of predictions is typically small, we develop an alternative approach that runs at speeds close to 120 frames/second.
Dean's rep: Dr. Rama Chellappa
Members: Dr. David Jacobs
Dr. John Dickerson
Dr. Tom Goldstein
Mahyar Najibi is a Ph.D. candidate in the Computer Science department at the University of Maryland, where he works with Prof. Larry Davis. His research interests lie in the intersection of computer vision and machine learning. In particular, his focus is on efficient object detection, face detection, and adversarial machine learning.