Statistical machine translation system quality depends on the example translations used to train the models. Data can come from a variety of sources, many of which are not optimal for common specific tasks. The goal is to be able to find the right data to use to train a model for a particular task. We determine the most relevant subsets of these large datasets with respect to a translation task, enabling the construction of task-specific translation systems that are more accurate and cheaper to train than the large-scale models.
We will present what has become the standard approach to identifying task-relevant training data for both language modeling and MT. We also describe a topic-model based extension suited for ultra-scarce in-domain data scenarios in MT.
One common drawback to these domain-relevant subsets is that their in-domain vocabulary coverage is poor compared to the full corpus. We present new work that can identify relevant training data with the best of both worlds: task-relevant data with the the lexical coverage of the full training corpus. This improvement in vocabulary coverage is paradoxically obtained by ignoring 70-99.99% of the words in the training corpora during the data selection process.
Joint work with Xiaodong He (MSR), Mari Ostendorf (UW), and Philip Resnik (CLIP/UMIACS).
Amittai joined CLIP in September as a postdoctoral researcher. His work focuses on quantifying textual differences, with the current goal of building small, cheap, and task-oriented statistical machine translation & natural language processing systems. He previously acquired degrees in mathematics, computer science, informatics, and electrical engineering.