Talks

PhD Proposal: Plug and Predict: Generative Recognition and Surrogate Training for VLMs

Kaiyu Yue

IRB-5105 or https://umd.zoom.us/j/7173057078

Tuesday, July 29, 2025, 10:00-11:30 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Traditional object recognition models, such as ResNet and CLIP, rely on a predefined label gallery, limiting their ability to handle real open-world scenarios. We propose a generative framework that predicts object labels as next tokens conditioned on image embeddings. With the proposed one-shot sampling strategy, our method enables parallel decoding of labels, supporting large-scale predictions such as top-100 labels per image. The second work tackles the high cost of training giant vision-language models (VLMs), where large language models (LLMs) are used as the decoder. We first analyze the prediction trajectory of LLMs to develop a general method for constructing smaller surrogate language models for any target LLM. Vision encoders trained on these surrogates can be zero-shot grafted into full-size LLMs for downstream tasks without additional tuning. When the decoder is fine-tuned on these encoders, our approach reduces the overall training cost by up to 45%, with Llama-70B as the decoder, while improving performance over baseline methods.

Bio

Kaiyu Yue is a fourth-year Computer Science Ph.D. student at University of Maryland, advised by Prof. Tom Goldstein. He is interested in computer vision and machine learning. His research mainly focuses on image-to-text and text-to-image models.

This talk is organized by Migo Gui