Zoom: https://umd.zoom.us/j/
Modern machine translation (MT) systems have achieved remarkable advances, but can we rely on them to win a Polish game show, where split-second decisions weighing the risk of MT errors, different cultural nuances, and noisy environments all come into play? I argue “Maybe not”. Outside the controlled settings, they can still fall short when used in complex real-world scenarios especially where people rely on MT outputs to make decisions under time pressure, navigate noisy audio environments, or bridge cultural gaps. This proposal presents a multifaceted research agenda aimed at enhancing the robustness, trustworthiness, and user-centricity of MT systems through six key research initiatives—ultimately, paving the way to winning the show.
To strengthen MT systems for real-world applications, we first focus on enhancing robustness and trustworthiness of speech translation and real-time evaluation of simultaneous MT. We begin by addressing the robustness of speech perception in noisy environments, introducing a cross-lingual audio-visual speech representation that leverages lip movements to improve recognition and translation performance. It maximizes the benefits of limited multilingual AV pre-training data by expanding modalities beyond audio-only multilingual pre-training model. Building on this, we focus on quality estimation for speech translation, emphasizing the need for users to gauge translation reliability before making decisions based on translation output—just as a contestant in a game show who decides whether to buzz in with an answer guess or wait for more information. We formulate and investigate the task of quality estimation for direct speech translation, comparing cascaded and end-to-end systems to help users assess translation quality without references, while contributing a new end-to-end model by extending a text-based language model to incorporate audio. Diverging from the standard way of assessing MT quality, we propose a question answering (QA) evaluation framework specifically for simultaneous MT word-by-word, ensuring that key facts are conveyed quickly and accurately for real-world tasks. This complements intrinsic QA and MT metrics by jointly accounting for timeliness and translation quality.
Enhancing robustness and trustworthiness on the model side alone is not enough to fully unlock the potential for optimal MT tool usage and to guarantee a win in the Polish game show. We shift our focus to user-centered challenges, addressing vocabulary adaptation, cultural knowledge gaps, and mental models. Users of lower-resourced language translations may face more disadvantages in generation speed, cost, and accuracy when using English-centric models, especially in time-sensitive settings. To address computational inequities, we transfer vocabulary using adapter modules to mitigate over-fragmentation—where words are excessively split into smaller subword units due to limited vocabulary coverage—reducing disparities in processing speed and performance. Also, even after addressing the challenges mentioned above, users from different cultural backgrounds may struggle to fully understand translation outputs due to cultural gaps between the source speaker and target audience—especially when a game show participant is neither Polish nor familiar with the culture. To bridge these gaps, we explore automatic explicitation techniques that make implicit cultural information explicit, while also introducing evaluation methods using downstream QA tasks. Finally, maximizing the performance of collaboration between users and MT tools requires a clear understanding of their mental models—awareness of AI tools’ strengths and limitations. In our proposed work, we aim to investigate users’ mental models of MT systems, focusing on how they perceive error boundaries and incorporate AI recommendations into decisions.
Collectively, these contributions offer a holistic framework for the next generation of MT, one that moves beyond mere text matching toward robust, context-aware, and user-focused solutions. Such advancements promise meaningful impacts in global communication, making MT more accessible and reliable across linguistic, cultural, and technological divides.
HyoJung Han is a Ph.D. student in Computer Science at the University of Maryland, College Park (UMD) advised by Marine Carpuat and Jordan Boyd-Graber. She participates in the CLIP Lab (Computational Linguistics and Information Processing). She interned at Meta FAIR and Microsoft during her Ph.D. She is interested in Multilingual and Multimodal NLP and its evaluation methods for tackling language barriers and even background gaps. Specifically, she’s been working on machine translation and speech translation.
Before her Ph.D., she was a research engineer at Samsung Research (SR), the advanced R&D hub of Samsung Electronics, mainly working on simultaneous and offline speech/text translation. She completed her M.S. at KAIST. HyoJung is the recipient of the Outstanding Graduate Assistant Award at UMD.