Generative foundation models for images and videos are pre-trained on internet-scale data, enabling them to learn broad visual priors for general-purpose generation. However, they are typically conditioned only on text prompts or a single reference image, which limits their applicability to real-world tasks that require richer visual guidance to produce specific, goal-directed outputs. In addition, training such models from scratch is prohibitively expensive for most domain-specific applications.
This proposal investigates how pre-trained generative foundation models can be adapted to real-world image and video tasks through data-efficient adaptation and enhanced visual conditioning. It argues that, with only limited domain-specific data, these general-purpose models can be transformed into effective tools for simulation and imagination, with applications in image and video generation, editing, and robotic simulation.
Jingxi Chen is a fourth-year CS PhD student. His research focuses on image, video, and multimodal generation, with an emphasis on adapting powerful generative foundation models to real-world applications such as content creation, editing, reconstruction, and robotic simulation. He has interned as a research intern at Dolby and Amazon, and will be an incoming research intern at Google Research. His work has been published at top CV/ML conferences, including CVPR, ICCV, and NeurIPS.
Examining Committee Chair: Dr. John Aloimonos
Department Representative: Dr. Ramani Duraiswami
Members: Dr. Christopher Metzler

