Talks

PhD Defense: Visual Content Synthesis at Scale

Songwei Ge

IRB-5105 or https://umd.zoom.us/j/9316628340

Monday, April 7, 2025, 11:30 am-1:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Humans love to create visual content. Every day, we take photos with smartphones, edit videos using intuitive apps, and create artworks through increasingly accessible digital tools. These widespread practices have led to an explosion of visual data shared continuously on the internet, building massive collections of images and videos that capture diverse human experiences. This enormous accumulation of visual data, together with rapid advancements in GPU computing, has become the foundation for training large-scale generative models, the key to automatically synthesizing top-tier visual content. By learning directly from the rich online visual repositories, these models internalize intricate patterns, styles, and concepts, enabling re-compose these elements to novel samples based on the user's inputs. In this thesis, we study and design scalable generative models that digest and improve with visual data, evaluation metrics that can precisely monitor the progress, and develop applications based on these pre-trained models. This thesis begins by designing frameworks for scalable video generation models. This includes both autoregressive models trained on the discrete tokens obtained through a discrete tokenizer and diffusion models trained directly on the pixels. In addition, we develop a novel video tokenization schema, enabling more compact video representations for larger generative models to train on. Next, we perform a careful analysis of the mainstream automatic evaluation metric. In the last chapter of the thesis, we study several practical scenarios to apply the pre-trained large-scale generative models, with tasks not only generation and beyond the original image and video domains.

Bio

Songwei is a fifth-year PhD student in Computer Science at University of Maryland, advised by Prof. Jia-Bin Huang and Prof. David Jacobs. He has interned at NVIDIA and Meta and was recipient of NVIDIA Research Fellowship. He received his Master's degree from CMU and Bachelor's degrees from Renmin University of China. His research primarily focuses on generative models applied to images and videos.

This talk is organized by Migo Gui