The computer vision community has embraced the success of learning specialist models by training with a fixed set of predetermined object categories, such as ImageNet or COCO. However, learning only from visual knowledge might hinder the flexibility and generality of visual models, which requires additional labeled data to specify any other visual concept and makes it hard for users to interact with the system. In this talk, I will present our recent work LSeg, a novel multimodal modeling method for Language-driven semantic image segmentation. LSeg uses a text encoder to compute embeddings of descriptive input labels (e.g., "grass" or "building") together with a transformer-based image encoder that computes dense per-pixel embeddings of the input image. The image encoder is trained with a contrastive objective to align pixel embeddings to the text embedding of the corresponding semantic class. The text embeddings provide a flexible label representation in which semantically similar labels map to similar regions in the embedding space (e.g., "cat" and "furry"). This allows LSeg to generalize to previously unseen categories at test time, without retraining or even requiring a single additional training sample. We show that joint embeddings allow the creation of semantic segmentation systems that can segment an image with any label set. Beyond that, I will briefly introduce several works about data-efficient algorithms such as data augmentation to boost the performance of neural models. At the end of this talk, I will talk about ongoing research and potential future directions for multimodal modelings, such as common sense reasoning and open-world recognition.
Boyi Li is a research scientist at NVIDIA Research and a visiting scholar at Berkeley AI Research. She received her Ph.D. at Cornell University, advised by Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Her research interest is in machine learning and computer vision. Her research primarily focuses on understanding daily scenes and human communication given visual context and building human-friendly innovative AI systems that support human-computer interaction in the real world. She has published first-author papers in ICCV, AAAI, IEEE TIP, NeurIPS, CVPR, ICLR, etc. She actively serves as the area chair and reviewer in the top venues. She also serves as the co-general chair of Women in Machine Learning Workshop 2021 and Women in Computer Vision Workshop 2020, 2021.