log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Defense: Generating Visual Content: From Pixel Orders to Videos and Beyond
Hanyu Wang
IRB-4105 https://umd.zoom.us/j/91998449224?pwd=eikLLACYfbt3Za5sNlfRlhnJUbU1Ee.1
Wednesday, December 17, 2025, 1:00-2:30 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Visual content generation is a fundamental challenge in computer vision that enables diverse applications across domains. The high-dimensional nature of visual data makes it particularly challenging to achieve both quality and precise control in generation tasks. This thesis investigates visual generation across varying levels of abstraction, ranging from fundamental pixel-level ordering to video synthesis, and extending beyond to the unification of perception and creation within large-scale multimodal systems.
We begin by addressing the foundational challenge of sequentially representing visual data through Neural Space-filling Curves, a data-driven approach that learns context-aware pixel orderings optimized for downstream tasks such as LZW compression. We then explore controlled image generation through two complementary approaches: Chop & Learn, a framework for compositional generation that enables synthesis of novel object-state combinations, and a multimodal style transfer method that effectively combines guidance from both images and text. For video generation, we introduce LARP, a novel tokenization approach with a learned autoregressive prior that achieves state-of-the-art performance while maintaining computational efficiency. Finally, we present Bridge, a unified framework that equips pre-trained MLLMs with visual generative capabilities. By utilizing a Mixture-of-Transformers architecture to handle conflicting modalities and a novel semantic-to-pixel discrete representation, Bridge enables high-precision visual understanding and high-fidelity generation within a single model, effectively closing the loop between perception and creation.

Bio

Hanyu Wang is a PhD candidate in Computer Science at the University of Maryland, College Park, where he is advised by Prof. Abhinav Shrivastava. His research focuses on computer vision and generative AI, with an emphasis on visual content generation under various conditions. His long-term goal is to build multimodal foundation models that unify the understanding and generation of different data types, enabling cross-modal learning and mutual enhancement across modalities.

This talk is organized by Migo Gui