Talks

Data Selection for Statistical Machine Translation

Amittai Axelrod - University of Maryland

Wednesday, December 10, 2014, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Statistical machine translation system quality depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. We determine the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and cheaper to train than the large-scale models.

We will present what has become the standard approach to identifying task-relevant training data for both language modeling and MT. We also describe a topic-model based extension suited for ultra-scarce in-domain data scenarios in MT.

One common drawback to these domain-relevant subsets is that their in-domain vocabulary coverage is poor compared to the full corpus. We present new work that can identify relevant training data with the best of both worlds: task-relevant data with the the lexical coverage of the full training corpus. This improvement in vocabulary coverage is paradoxically obtained by ignoring 70-99.99% of the words in the training corpora during the data selection process.

Joint work with Xiaodong He (MSR), Mari Ostendorf (UW), and Philip Resnik (CLIP/UMIACS).

Bio

Amittai joined CLIP in September as a postdoctoral researcher. His work focuses on quantifying textual differences, with the current goal of building small, cheap, and task-oriented statistical machine translation & natural language processing systems. He previously acquired degrees in mathematics, computer science, informatics, and electrical engineering.

This talk is organized by Jimmy Lin