In recent years, generative visual synthesis has achieved remarkable success, with diffusion models setting new benchmarks in photorealism and creative expression. However, most state-of-the-art systems rely primarily on natural language prompts, which are often insufficient for capturing precise user intent. Text prompts struggle to define exact spatial arrangements, object quantities, and specific object appearance , creating a "control gap" between human creativity and algorithmic output. To address this, research has shifted toward controllable synthesis, which empowers users to guide the generative process through more intuitive and structured signals. This thesis explores the paradigm of controllability across two key dimensions: spatial structure and object identity.
The first work, Grounded text-to-image synthesis with attention refocusing, focuses on structural control through layout-based grounding. We address the limitation where models fail to follow the spatial logic of a prompt, introducing a framework that ensures the generated image strictly adheres to a specified spatial layout. This allows users to precisely define the location and scale of multiple objects within a scene. The second wok, Universe, extends controllability to identity control. We address the challenge of maintaining a specific object’s visual characteristics across different contexts. By developing a method to extract and preserve a "concept of interest" from a reference image, we provide users with the framework to generate unique subjects into new synthesized results with high fidelity.
By bridging the gap between abstract generative models and specific user needs, this research enables precise control over both the structural layout and the visual appearance of subjects. These advancements are helpful for creating personalized digital media, developing structured visual story-telling, and establishing more reliable human-AI collaboration in real-world creative applications11
Quynh Thi Phung is a 4th-year PhD student advised by Prof. Jia-Bin Huang. She has interned at Adobe Research. Her research primarily focuses on controllable image and video generation.f

