Talks

PhD Proposal: Self-Supervised learning on large scale datasets

Shlok Mishra

4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Thursday, September 22, 2022, 10:30 am-12:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Humans and animals understand and perceive the world using very few if any labels. Most of the knowledge acquired by humans is without any explicit super- vision and learnt by just processing large amounts of unlabelled data. This suggests that learning without using labels would be a principled way for machines to understand the world. However most of the progress made by state-of-the-art deep neural networks has been fuelled by their reliance upon the annotations in datasets. Annotating datasets is expensive and unfeasible for a lot of domains. Manual annotations can sometimes be biased due to the annotators own biases and can be noisy and unreliable.

This manuscript discusses various ways ma- chines can be taught without using any labels using Self-Supervised Learning (SSL). We show that generally training machines without using any labels can result in less biased and more robust representations. This manuscript deals with mainly three types of issues in SSL. The first problem we tackle is the over-emphasis of neural networks on low level shortcuts such as texture. Consider the example of a sofa with texture of a leopard. State- of-the-art neural networks will often predict this sofa to be a leopard, instead of a sofa. Unlike humans, neural networks don’t understand the shape of objects and often rely on low level cues. To solve this we propose two different methods. To reduce reliance on texture cues we firstly propose to suppress texture in images, which helps the neural networks to focus less on texture and more on higher level information such as shape. Secondly we augment the SSL learning methods with negative samples which contain only texture from the images. By augmenting with texture based images our method achieves better generalization, especially in out-of-domain settings.

The second problem we deal with is the poor performance of SSL methods on multi-object datasets like OpenImages. SSL has made great advances when trained on object centric datasets like ImageNet, however the same success doesn’t really transfer to multi-object datasets like OpenImages. One of the fundamental reasons behind this is the cropping data augmentations that select sub-regions of an image to be used as positive samples. These positive samples are generally very meaningful since in object centric datasets they often contain semantic overlap between the views. However this doesn’t hold for multi-object datasets since there could be multiple objects and the two views might not have semantic overlap. To remedy this we propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This 1 encourages the network to learn more object aware representations that result in significant improvement over the random crop baselines.

Thirdly, current SSL networks generally treat objects and scenes using the same framework. However visually similar objects are close in the representation space, hence we argue that scenes and objects should follow a hierarchical structure based on their compositionality. To solve this, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. Our hyperbolic loss encourages the network to have a scene-object hypernymy by optimizing the magnitude of their norms. We show the effectiveness of our proposed hyperbolic loss by pre-training on OpenImages and COCO datasets and show improved downstream performance across multiple datasets and tasks, including image classification, object detection, and semantic segmentation.

Examining Committee:

Chair:
Department Representative:
Members:

Dr. David Jacobs
Dr. Furong Huang
Dr. Abhinav Shrivastava

Bio

Shlok Mishra is a 5th year PhD student in the Department of Computer Science at the University of Maryland, College Park. He is advised by Prof. David Jacobs. His research interests are in the areas of machine learning and computer vision. More specifically, his research aims to learn high-level representations from images without using any labels in completely un/self-supervised fashion.

This talk is organized by Tom Hurst