log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Defense: Vision and NLP for Creative Applications, and their Analysis
Varun Manjunatha
Monday, November 12, 2018, 12:30-2:30 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)

Recent advances in machine learning, specifically problems in Computer Vision and Natural Language, have involved training deep neural networks with enormous amounts of data. The first frontier for deep networks was in uni-modal classification and detection problems (which were directed more towards "intelligent robotics" and surveillance applications), while the next wave involves deploying deep networks on more creative tasks and common-sense reasoning.

In the first part of the dissertation, I cover colorization of black and white images. Automatic colorization is the process of adding color to greyscale images. We condition this process on language, allowing end users to manipulate a colorized image by feeding in different captions. We present two different architectures for language-conditioned colorization, both of which produce more accurate and plausible colorizations than a language-agnostic version. Through this language-based framework, we can dramatically alter colorizations by manipulating descriptive color words in captions.

In the second part of the dissertation, I cover the analysis of comic books using deep neural networks. In this work, we construct a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language.

Finally, I cover a method towards understanding model behaviors in the Visual Question Answering (VQA) problem. Researchers have observed that VQA models tend to answer questions by learning statistical biases in the data (For example, the answer to the question "What is the color of the sky?" is usually "Blue"). It is of interest to the community to explicitly discover such biases, both for understanding the behavior of such models, and towards debugging them. We present a simple technique towards doing so, and note some unusual behaviors learned by the model in attempting VQA tasks.

Examining Committee: 
                          Chair:               Dr. Larry Davis
                          Dean's rep:      Dr. Rama Chellappa
                          Members:        Dr. Jordan Boyd-Graber
                                                    Dr. Tom Goldstein
                                                    Dr. David Jacobs
This talk is organized by Tom Hurst