log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Proposal: Plug and Predict: Generative Recognition and Surrogate Training for VLMs
Kaiyu Yue
IRB-5105 or https://umd.zoom.us/j/7173057078
Tuesday, July 29, 2025, 10:00-11:30 am
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Traditional object recognition models, such as ResNet and CLIP, rely on a predefined label gallery, limiting their ability to handle real open-world scenarios. We propose a generative framework that predicts object labels as next tokens conditioned on image embeddings. With the proposed one-shot sampling strategy, our method enables parallel decoding of labels, supporting large-scale predictions such as top-100 labels per image. The second work tackles the high cost of training giant vision-language models (VLMs), where large language models (LLMs) are used as the decoder. We first analyze the prediction trajectory of LLMs to develop a general method for constructing smaller surrogate language models for any target LLM. Vision encoders trained on these surrogates can be zero-shot grafted into full-size LLMs for downstream tasks without additional tuning. When the decoder is fine-tuned on these encoders, our approach reduces the overall training cost by up to 45%, with Llama-70B as the decoder, while improving performance over baseline methods.

Bio

Kaiyu Yue is a fourth-year Computer Science Ph.D. student at University of Maryland, advised by Prof. Tom Goldstein. He is interested in computer vision and machine learning. His research mainly focuses on image-to-text and text-to-image models.

This talk is organized by Migo Gui