While parallel texts represent invaluable resources for machine translation, they inevitably introduce biases in the cross-lingual mappings learned by machine translation models. In addition to the domain bias and translationese bias studied in past work, we argue that another form of bias arises from subtle choices in content and style made by translators to appropriately convey the meaning of the source to their target audience.
We will first study the impact of such bias on training. We will show that it can lead to mismatches in the meaning of source and target segments in parallel texts, and that these mismatches can have a substantial impact on the quality of neural machine translation. We will then turn to the problem of producing machine translation to a specific audience by controlling not only the content, but also the style of the output.
Joint work with Marianna Martindale, Xing Niu and Yogarshi Vyas.