While a standard megapixel image might be worth a thousand words, it is at the same time worth more than a million pixels, and videos are worth many millions more. Understanding the local and global structures of these pixels, and their meaning, has been a core problem since the birth of computer vision as a field. Storing these massive amounts of pixels is another issues -- in the social media age, and with the democratization of access to high quality cameras, hundreds of millions of terabytes of data are created every day. In my research, I aim to tackle both of these problems by designing good deep learning representations for tasks ranging from image classification, to text-conditioned generation, to even video compression.
In this talk, I first discuss my work around unsupervised and multimodal image understanding. I describe a benchmark where I compare learned representations in terms of both their downstream task performance as well as by comparing the embeddings themselves. I create a pipeline for generating synthetic text data to help perform better benchmarking and training of multimodal models for long video understanding.
Second, I investigate diffusion models as a sort of unified representation learner. I explore the capacity of pre-trained diffusion networks for recognition tasks and present a lightweight, learnable feedback mechanism to improve the performance. I propose to adapt that feedback mechanism for fast, higher-quality image generation.
Finally, I discuss an alternative paradigm for image understanding -- implicit neural representation. I provide an overview of this area, including my works for video compression. I also present a framework for better understanding what these models learn. I propose to build a system for real-time, high quality video compression by adapting hypernetworks, which predict model weights from video inputs, to predict compact, high-fidelity implicit representations.
Matthew Gwilliam is a fifth year Ph.D student in the department of Computer Science at the University of Maryland (UMD), advised by Professor Abhinav Shrivastava. He studies computer vision. He completed his B.S. in Computer Science at Brigham Young University in 2019, where he was fortunate to work with Ryan Farrell. During his PhD, he has enjoyed opportunities for professional growth as an intern at Amazon, SRI, and NVIDIA; he also was the primary organizer for the workshop "Implicit Neural Representation for Vision" at CVPR 2024.

