log in  |  register  |  feedback?  |  help  |  web accessibility
Logo
Data Selection for Statistical Machine Translation
Wednesday, December 10, 2014, 11:00 am-12:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Abstract

Statistical machine translation system quality depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. We determine the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and cheaper to train than the large-scale models.

We will present what has become the standard approach to identifying task-relevant training data for both language modeling and MT. We also describe a topic-model based extension suited for ultra-scarce in-domain data scenarios in MT.

One common drawback to these domain-relevant subsets is that their in-domain vocabulary coverage is poor compared to the full corpus. We present new work that can identify relevant training data with the best of both worlds: task-relevant data with the the lexical coverage of the full training corpus. This improvement in vocabulary coverage is paradoxically obtained by ignoring 70-99.99% of the words in the training corpora during the data selection process.

Joint work with Xiaodong He (MSR), Mari Ostendorf (UW), and Philip Resnik (CLIP/UMIACS).

Bio

Amittai joined CLIP in September as a postdoctoral researcher. His work focuses on quantifying textual differences, with the current goal of building small, cheap, and task-oriented statistical machine translation & natural language processing systems. He previously acquired degrees in mathematics, computer science, informatics, and electrical engineering.

This talk is organized by Jimmy Lin