Talks

Scalable Topic Models and Applications to Machine Translation

Wednesday, March 26, 2014, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Topic models are powerful tools for statistical analysis in text processing. Despite their success, application to large datasets is hampered by scaling inference to large parameter spaces. In this talk, we describe two ways to speed up topic models: parallelization and streaming. We propose a scalable and flexible implementation using variational inference on MapReduce. We further demonstrate two extensions of this model: using informed priors to incorporate word correlations, and extracting topics from a multilingual corpus. An alternative approach to achieve scalability is streaming, where the algorithm sees a small part of data at a time and update the model gradually. Although many streaming algorithms have been proposed for topic models, they all overlook a fundamental but challenging problem---the vocabulary is constantly evolving over time. We propose an online topic models with infinite vocabulary, which address the missing piece, and show that our algorithm is able to discover new words and refine topics on the fly. In addition, we also examine how topic models are helpful in acquiring domain knowledge and improving machine translation.

Bio

Ke Zhai is a Ph.D. candidate in Computer Science working with Jordan Boyd-Graber.

This talk is organized by Jimmy Lin