Talks

PhD Proposal: Gathering Language Data Using Experts

Denis Peskov

Remote

Friday, February 5, 2021, 1:30-3:30 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Natural Language Processing needs substantial data to make robust predictions. We compare projects that use automatic generation, crowd-sourcing, and using domain experts to generate large textual corpora. Specifically, we curate conversational and question answering NLP datasets.

A low-cost, high-output approach to data creation is automation. We explore this approach by creating a large-scale audio question answering dataset through text-to-speech technology. We conclude that the cost-savings and scalability of automation come at the cost of data quality and naturalness.

Human input can provide a degree of naturalness, but is limited in scale. Hence, large-scale data collection is frequently done through crowd-sourcing. A question-rewriting task, in which a long information-gathering conversation is used as source material for many stand-alone questions, shows the limitation of using this methodology for generating data. Standard inter-annotator agreement metrics, while useful for annotation, cannot easily evaluate generated data, causing a serious quality control issue. This problem is observed while formalizing a question- rewriting task; certain users provide low-quality rewrites—removing words from the question, copy and pasting the answer into the question—if left unsupervised. We develop an interface to prevent bad submissions from happening and hand-review over 5,000 submissions.

We mitigate the quality control issues identified in crowd-sourcing and automation through exploring hybrid solutions. In one hybrid approach, Amazon customer service agents are used for curation and annotation of goal-oriented 81,000 conversations across six domains. By grounding the conversation with a reliable conversationalist—the Amazon agent—we create untemplated conversations and reliably identify low-quality conversations. The language generated from crowd workers is severely lower in quality and would not create natural dialogues.

But natural sources of data can be found in specialized communities of interest. We posit that domain experts can be used to create large and varied datasets that do not require extensive quality control. In a study on the game of Diplomacy, which investigates the language of trust and deception, Diplomacy community members generate a corpus of 17,000 messages that are self-annotated while playing a game. The language is varied in length, tone, vocabulary, punctuation, and even emojis! Additionally, we create a real-time self-annotation system that annotates deception in a manner not possible through crowd-sourced or automatic methods. We propose future work that leverages experts to create a new machine translation task: cultural adaptation. Identifying relevant communities for a specific NLP task, and providing a service to them can set new standards for NLP corpora.

Examining Committee:

Chair: Dr. Jordan Boyd-Graber
Dept rep: Dr. Michelle Mazurek
Members: Dr. Philip Resnik

Bio

Denis Peskov is a fifth-year Ph.D. student in the Department of Computer Science, working with Prof. Jordan Boyd-Graber. His interests lie in using experts to create datasets for Natural Language Processing.

This talk is organized by Tom Hurst