Talks

PhD Proposal: Detecting Fine-Grained Semantic Divergences to Improve Translation Understanding Across Languages

Eleftheria Briakou

5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Friday, May 20, 2022, 1:00-3:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

One of the core goals of natural language processing (NLP) is to develop computational representations and methods to compare and contrast text meaning across languages. Such methods are essential to many NLP tasks, such as question answering and information retrieval. One of the limitations of those methods is the lack of sensitivity to detecting fine-grained semantic divergences, i.e., fine meaning differences in sentences that overlap in content. Yet, such differences abound even in parallel texts, i.e., texts in two different languages that are typically perceived as exact translations of each other. Detecting such fine-grained semantic divergences across languages matters for machine translation systems, as they yield challenging training samples, and for humans, who can benefit from a nuanced understanding of the source.

In this proposal, we focus on detecting fine-grained semantic divergences in parallel texts to improve machine and human translation understanding. In the first piece of completed work, we start by providing empirical evidence that such small meaning differences exist and can be reliably annotated both at a sentence and at a sub-sentential level. Then, we show that they can be automatically detected by fine-tuning large pre-trained language models without supervision by learning to rank synthetic divergences of varying granularity. In our second piece of completed work, we turn to analyzing the impact of fine-grained divergences on Neural Machine Translation (NMT) training and show that they negatively impact several aspects of NMT outputs, e.g., translation quality and confidence. Based on these findings, we propose two orthogonal approaches to mitigating the negative impact of divergences and improve machine translation quality: first, we introduce a divergent-aware NMT framework that models divergences at training time; second, we propose generation-based approaches for revising divergences in mined parallel texts to make the corresponding references more equivalent in meaning.

Having observed how subtle meaning differences in parallel texts impact downstream applications (i.e., NMT), in our first proposed work, we now ask how divergence detection can be used by humans directly. We propose to extend our current divergence detection methods to explaining the nature of divergences. Our approach will not only point to specific divergent segments within parallel texts, but also augment them with information external to the input (e.g., translated segment is more specific than the original) that indicates not only whether but also how two texts differ. The success of our approach will be quantified both automatically—via comparing the explanations with gold-standard annotations—and via a user study that tests whether explanations help humans understand translations better.

Examining Committee:

Chair:
Department Representative:

Dr. Marine Carpuat
Dr. Leo Zhicheng Liu
Dr. Philip Resnik
Dr. Hal Daumé III
Dr. Luke Zettlemoyer (Univ of WA)

Bio

Eleftheria is a fourth-year Ph.D. student in the Department of Computer Science at the University of Maryland, College Park. She is a member of the CLIP lab advised by Marine Carpuat. Eleftheria's research interests are in Multilingual Natural Language Processing (NLP) and machine learning. Her recent work focuses on detecting differences in meaning across languages to improve human and machine translation understanding.

Previously, she was a Research Intern at Facebook AI mentored by Marjan Ghazvininejad during Summer 2021 and at Dataminr mentored by Joel Tetreault during Summer 2020.

This talk is organized by Tom Hurst