Talks

PhD Defense: Towards Unifying Multimodel: Perception, Reasoning, Generation

Jiuhai Chen

IRB-4109 https://umd.zoom.us/j/8501825249?omn=91672049201&jst=2

Wednesday, January 28, 2026, 12:00-2:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Firstly, we introduce BLIP3-o, a unified foundation model for both image understanding and image generation. Recently, unified multimodal models that support both image understanding and generation have gained increasing attention. However, the optimal design choice and training strategy for unified model remain an open question. In this work, we present a comprehensive study of image generation based on autoregressive and diffusion models, exploring different image representations (e.g., VAE and CLIP encoders) and modeling methods such as Mean Squared Error (MSE) and Flow Matching. We introduce a novel approach that uses a diffusion transformer to diffuse CLIP image features, achieving high training efficiency and strong performance. We also investigate joint and sequential training strategies for image understanding and image generation and find that sequential training offers practical benefits by preserving image understanding while enabling effective image generation. Based on these findings, we develop a state-of-the-art unified model BLIP3-o. Our model demonstrates superior performance on a wide range of benchmarks for both image understanding and generation. We also showcase applications, such as image editing, reconstruction, and interleaved generation that highlight the necessity of integrating image understanding and generation. All model weights, code, and evaluation pipelines are open-sourced to support future research.

Then, we introduce BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

Lastly, we briefly introduce Mini Banana, SOTA open source native image generation model, including architecture, training and applications.

Bio

Jiuhai Chen is a last-year Computer Science Ph.D. student at the University of Maryland, advised by Prof. Tianyi Zhou and Prof. Tom Goldstein. He is interested in image generation, video generation, and multimodal reasoning.

This talk is organized by Migo Gui