In recent years, computer vision has undergone a paradigm shift from static, model-centric analysis to dynamic, interactive systems. These systems empower users to guide and refine model behavior with visual and textual prompts, transforming them from passive observers into active collaborators. This evolution significantly enhances model effectiveness, resolves ambiguity, and broadens the accessibility of AI-driven applications. This thesis presents a cohesive journey through the landscape of interactive prompting, charting a course from simple pixel-level guidance to complex semantic reasoning.
The journey begins with interactive visual prompting, where minimal user input is leveraged to solve complex segmentation tasks. The first chapter introduces SimpSON, a novel framework that dramatically simplifies photo cleanup by segmenting multiple distracting objects with just a single click. The subsequent chapter advances this theme with MaGGIe, which tackles the inherent ambiguity of instance matting in multi-person scenes. By using coarse mask prompts, MaGGIe achieves precise alpha mattes, demonstrating how minimal but targeted guidance can ensure robust and accurate outcomes. These works underscore how simple visual interactions can profoundly enhance segmentation accuracy while minimizing user effort.
The thesis then transitions to interactive textual prompting, exploring how natural language unlocks greater expressivity in vision systems. The third part introduces CoLLM, a retrieval framework that aligns nuanced textual descriptions with precise visual content without relying on explicitly annotated triplet datasets. Building upon this, the fourth part expands into the omni-modal domain with OmniRet, a unified model that seamlessly integrates image, video, audio, and text. OmniRet learns a joint embedding space to interpret complex, cross-modal user queries, establishing a new foundation model for flexible and efficient information retrieval systems.
The final part of this thesis confronts the core challenge of unstructured prompts by introducing structured semantic representations. This section establishes the scene graph as a critical intermediate representation for bridging the gap between language and vision. We demonstrate its power across three domains: enhancing the relational alignment in image-text matching, guiding the compositional generation of content in text-to-image synthesis, and enabling verifiable, grounded action sequences in robotic task planning. These applications reveal how structured semantics are key to unlocking more robust, accurate, and interpretable AI.
Collectively, this thesis charts a clear path from simple interactions to sophisticated reasoning, providing a comprehensive exploration of prompting paradigms that are making computer vision systems more responsive, precise, and intuitive. These advancements not only redefine user-model collaboration but also mark significant progress toward more powerful and accessible artificial intelligence.
Chuong Huynh is a fifth-year Ph.D. candidate in Computer Science at the University of Maryland, College Park, where he is advised by Professor Abhinav Shrivastava. His research focuses on interactive computer vision, exploring how user input, whether visual or textual, can drive deeper and more adaptive image understanding. Passionate about building intelligent systems that collaborate with people, Chuong has applied his work in real-world settings through research internships at Adobe, Amazon and Samsung Research America, where he contributed to advancing user-centric AI technologies.