Talks

PhD Defense: Identifying Semantic Divergences across Languages

Yogarshi Vyas

4105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Tuesday, September 10, 2019, 9:30-11:30 am

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Cross-lingual resources such as parallel corpora and bilingual dictionaries are cornerstones of multilingual natural language processing (NLP). They have been used to study the nature of translation, train automatic machine translation systems, as well as to transfer models across languages for an array of NLP tasks. However, the majority of work in cross-lingual and multilingual NLP assumes that translations recorded in these resources are semantically equivalent. This is often not the case---words and sentences that are considered to be translations of each other frequently diverge in meaning, often in systematic ways.

In this thesis, we focus on such mismatches in meaning in text that we expect to be aligned across languages. We term such mismatches as cross-lingual semantic divergences. The core claim of this thesis is that translation is not always meaning preserving which leads to cross-lingual semantic divergences that affect multilingual NLP tasks. Detecting such divergences requires ways of directly characterizing differences in meaning across languages through novel cross-lingual tasks, as well as models that account for translation ambiguity and do not rely on expensive, task-specific supervision.

We support this claim through three main contributions. First, we show that a large fraction of data in multilingual resources (such as parallel corpora and bilingual dictionaries) are identified as semantically divergent by human annotators. Second, we introduce cross-lingual tasks that characterize differences in word meaning across languages by identifying the semantic relation between two words. We also develop methods to predict such semantic relations, as well as a model to predict whether sentences in different languages have the same meaning. Finally, we demonstrate the impact of divergences by applying the methods developed in the previous sections to two downstream tasks. We first show that our model for identifying semantic relations between words helps in separating equivalent word translations from divergent translations in the context of bilingual dictionary induction, even when the two words are close in meaning. We also show that identifying and filtering semantic divergences in parallel data can help in training a neural machine translation twice as fast without sacrificing quality.

Examining Committee:

Chair: Dr. Marine Carpuat
Dean's rep: Dr. Philip Resnik
Members: Dr. Jordan Boyd-Graber
Dr. Ido Dagan
Dr. David Jacobs

Bio

Yogarshi Vyas is a PhD student in the Department of Computer Science at the University of Maryland, College Park. His broad research interests lie in semantics, multilingual NLP, machine translation, and the intersection of these. He is particularly excited about building models to compare the meaning of words and sentences across languages. His work has won the Adam Kilgarriff Best Paper award at *SEM 2017.

This talk is organized by Tom Hurst