log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Proposal: Generative Visual Understanding: From Emergence to Application
Soumik Mukhopadhyay
IRB-4105 or https://umd.zoom.us/j/4393962416?pwd=QAcw7m1elrY9r0H5RTsrsS8PJxq71y.1&omn=96441740152
Monday, January 5, 2026, 2:00-3:30 pm
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Modern visual generative models have fueled creativity among even non-experts, shattering the barrier to entry of artistic training and creation time. These models have illustrated contextual understanding by maintaining surprising level of consistency with the input condition, context, and constraints. This naturally leads to the following questions. Do these generative models actually understand the underlying structures, texture, and semantics of the images they generate? If so, can we employ the structure of this understanding to ameliorate generative models further? This thesis seeks answers to these questions in the case of denoising diffusion probabilistic models (hereafter referred to as diffusion models).

Firstly, we establish the existence of contextual understanding in diffusion models using a concrete exemplar task of audio-conditioned lip-synchronization. Our generalizable in-the-wild results shows that we were successfully able to explicitly instill contextual understanding (using constraints like conditional inputs, and multiple losses) in diffusion models.

Secondly, by probing unconditional diffusion models, we investigate whether the diffusion training itself fosters this kind of understanding, or whether it happens only because of the conditions/constraints. To this end, we observe the variety in the information spread in the features across noise levels and neural network blocks. Our feature accumulation techniques obtain promising performance on discriminatory tasks, redefining diffusion models as unified self-supervised representation learners.

Finally, we analyze the stark resemblance in the hierarchical informational content in the states of diffusion as compared to the scale spaces in Gaussian pyramids. To leverage this insight, we propose frameworks for integrating these two well-established computer vision techniques for achieving superior performance at increased efficiency — potentially bringing back pixel space diffusion to the forefront.

Bio

Soumik Mukhopadhyay is a PhD student in Computer Science at the University of Maryland, College Park advised by Prof. Abhinav Shrivastava. His research focuses on visual generative models and their use in generation, representation, and understanding.

This talk is organized by Migo Gui