PhD Defense: Learning from Less Data: Perception and Synthesis
Divya Kothandaraman
Abstract
Next, we will focus on zero-shot personalized image and video generation, aiming to create content based on custom concepts. We propose methods that leverage prompting to generate images and videos at the intersection of various manifolds corresponding to these concepts and pretrained models. Our work has applications in subject-driven action transfer and multi-concept video customization. These solutions are among the first in this area, showing significant improvements over baselines and related work. Our approaches are also data and compute efficient, relying solely on pretrained models without the need for additional training data. Finally, we introduce a fundamental prompting solution inspired by techniques from finance and economics, demonstrating how insights from different fields can effectively address similar mathematical challenges.
Machine learning techniques have transformed various fields, particularly in computer vision. However, they typically require vast amounts of labeled data for training, which can be costly and impractical. This dependency on data highlights the importance of research into data efficiency. We present our work on advancements in data-efficient deep learning within the contexts of visual perception and visual generation tasks.
In the first part of the talk, we will present a glimpse of our work on data efficiency in visual perception. Specifically, we tackle the challenge of semantic segmentation in autonomous driving, assuming limited access to labeled data in both the target and related domains. We propose self-supervised learning solutions to enhance segmentation performance in unstructured and adverse weather conditions, ultimately extending to a more generalized approach that is on par with methods using immense amounts of labeled data, achieving up to 30% improvements over prior work. Next, we address data efficiency for autonomous aerial vehicles, specifically in video action recognition. Here, we integrate concepts from signal processing into neural networks, achieving both data and computational efficiency. Additionally, we propose differentiable learning methods for these representations, resulting in 8-38% improvements over previous work.
In the second part of the talk, we will delve into data efficiency in visual generation. We will begin by discussing the efficient generation of aerial-view images, utilizing pretrained models to create aerial perspectives from input scenes in a zero-shot manner. By incorporating techniques from classical computer vision and information theory, our work enables the generation of aerial images from complex, real-world inputs without requiring any 3D or paired data during training or testing. Our approach is on par with concurrent methods that use vast amounts of 3D data for training.
In the first part of the talk, we will present a glimpse of our work on data efficiency in visual perception. Specifically, we tackle the challenge of semantic segmentation in autonomous driving, assuming limited access to labeled data in both the target and related domains. We propose self-supervised learning solutions to enhance segmentation performance in unstructured and adverse weather conditions, ultimately extending to a more generalized approach that is on par with methods using immense amounts of labeled data, achieving up to 30% improvements over prior work. Next, we address data efficiency for autonomous aerial vehicles, specifically in video action recognition. Here, we integrate concepts from signal processing into neural networks, achieving both data and computational efficiency. Additionally, we propose differentiable learning methods for these representations, resulting in 8-38% improvements over previous work.
In the second part of the talk, we will delve into data efficiency in visual generation. We will begin by discussing the efficient generation of aerial-view images, utilizing pretrained models to create aerial perspectives from input scenes in a zero-shot manner. By incorporating techniques from classical computer vision and information theory, our work enables the generation of aerial images from complex, real-world inputs without requiring any 3D or paired data during training or testing. Our approach is on par with concurrent methods that use vast amounts of 3D data for training.
Next, we will focus on zero-shot personalized image and video generation, aiming to create content based on custom concepts. We propose methods that leverage prompting to generate images and videos at the intersection of various manifolds corresponding to these concepts and pretrained models. Our work has applications in subject-driven action transfer and multi-concept video customization. These solutions are among the first in this area, showing significant improvements over baselines and related work. Our approaches are also data and compute efficient, relying solely on pretrained models without the need for additional training data. Finally, we introduce a fundamental prompting solution inspired by techniques from finance and economics, demonstrating how insights from different fields can effectively address similar mathematical challenges.
Bio
Divya Kothandaraman is a Computer Science PhD candidate at the University of Maryland College Park, working with Prof. Dinesh Manocha. Previously, she was an undergraduate at the Indian Institute of Technology Madras, where she obtained a bachelors degree in Electrical Engineering, and masters degree in Data Sciences.
Her broader research interests lie at the intersection of computer vision, deep learning and multi-modal learning. Her recent works range from developing novel methods for generative AI tasks in controllable image and video generation such as personalization, novel-view synthesis, and prompt mixing to developing deep learning based solutions for computer vision tasks such as domain adaptation and video action recognition.
This talk is organized by Migo Gui