Visual content generation is a fundamental challenge in computer vision that enables diverse applications across domains. The high-dimensional nature of visual data makes it particularly challenging to achieve both quality and precise control in generation tasks. This thesis investigates visual generation across modalities, ranging from pixel-level ordering to complex spatiotemporal data synthesis.
We begin by addressing the foundational challenge of sequentially representing visual data through Neural Space-filling Curves, a data-driven approach that learns context-aware pixel orderings optimized for downstream tasks such as LZW compression. We then explore controlled image generation through two complementary approaches: Chop & Learn, a framework for compositional generation that enables synthesis of novel object-state combinations, and a multimodal style transfer method that effectively combines guidance from both images and text. For video generation, we introduce LARP, a novel tokenization approach with a learned autoregressive prior that achieves state-of-the-art performance while maintaining computational efficiency. Finally, we propose directions for future research that focus on two key areas: advancing latent visual diffusion models and adapting LLMs for high-fidelity visual generation.
Hanyu Wang is a PhD student in Computer Science at the University of Maryland, College Park, where he is advised by Prof. Abhinav Shrivastava. He holds a B.Eng. in Computer Science and Technology from Xi’an Jiaotong University and an M.S. in Computer Science from the University of Maryland. His research focuses on visual content generation.