Talks

Is this the beginning or is this the end for end-to-end vision models?

Ani Kembhavi

IRB 4105, Zoom Link: https://umd.zoom.us/j/99048512875

Monday, July 17, 2023, 2:30-3:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Large language models like GPT-4 support a whole gamut of tasks in natural language, some out of the box and others using a few examples via in context learning. In contrast, unification has been more challenging in computer vision, partly due to the heterogeneity of tasks in the visual domain. How do we create unified systems for vision that can be as capable and creative as language counterparts ? In this talk, I will present two different paths that we are actively exploring. The first path is to build large end to end models for computer vision, and along this direction I will introduce Unified-IO, the first single neural model to perform a large and diverse set of AI tasks spanning classical computer vision, image synthesis, vision-and-language, and natural language processing. The second path is Visual Programming, where given a natural language description of a vision task, a program generator creates a program which is then executed on the task inputs using a program interpreter. This paradigm uses language models to parse instructions and generate code, leverages specialized vision models that the community is building and ever improving, and scales easily to large sets of diverse tasks.

Bio

Ani Kembhavi is the Senior Director of Computer Vision at the Allen Institute for AI in Seattle. He is also an Affiliate Associate Professor in the Department of Computer Science & Engineering at the University of Washington. He is interested in research problems at the intersection of vision, language and embodiment. He got his PhD from the University of Maryland under the supervision of Prof. Larry Davis and also spent several years at Microsoft Bing building large scale machine learning systems for Image and Video Search. His work has won several awards including the Best Paper award at CVPR 2023 and the Outstanding Paper award at Neurips 2022.

This talk is organized by Samuel Malede Zewdu