In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision.
We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) are identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data can help in training a neural machine translation twice as fast without sacrificing quality.
Dean's rep: Dr. Philip Resnik
Members: Dr. Jordan Boyd-Graber
Dr. Ido Dagan
Dr. David Jacobs
Yogarshi Vyas is a PhD student in the Department of Computer Science at the University of Maryland, College Park. His broad research interests lie in semantics, multilingual NLP, machine translation, and the intersection of these. He is particularly excited about building models to compare the meaning of words and sentences across languages. His work has won the Adam Kilgarriff Best Paper award at *SEM 2017.