log in  |  register  |  feedback?  |  help  |  web accessibility
PhD Defense: Gathering Language Data Using Experts
Denis Peskov
Thursday, December 16, 2021, 3:00-5:00 pm Calendar
  • You are subscribed to this talk through .
  • You are watching this talk through .
  • You are subscribed to this talk. (unsubscribe, watch)
  • You are watching this talk. (unwatch, subscribe)
  • You are not subscribed to this talk. (watch, subscribe)
Natural language processing needs substantial data to make robust predictions. Automatic methods, unspecialized crowds, and domain experts can be used to collect conversational and question answering nlp datasets. A hybrid solution of combining domain experts with the crowd generates large-scale, free-form language data.

A low-cost, high-output approach to data creation is automation. We create and analyze a large-scale audio question answering dataset through text-to-speech technology. Additionally, we create synthetic data from templates to identify limitations in machine translation. We conclude that the cost-savings and scalability of automation come at the cost of data quality and naturalness.

Human input can provide this degree of naturalness, but is limited in scale. Hence, large-scale data collection is frequently done through crowd-sourcing. A question-rewriting task, in which a long information-gathering conversation is used as source material for many stand-alone questions, shows the limitation of using this methodology for generating data. Certain users provide low-quality rewrites— removing words from the question, copy and pasting the answer into the question—if left unsupervised. We automatically prevent unsatisfactory submissions with an interface, but the quality control process requires manually reviewing 5,000 questions.

Therefore, we posit that using domain experts for data generation can create novel and reliable nlp datasets. First, we introduce computational adaptation, which adapts, rather than translates, entities across cultures. We work with native speakers in two countries to generate the data, since the gold label for this is subjective and paramount. Furthermore, we hire professional translators to assess our data. Last, in a study on the game of Diplomacy, community members generate a corpus of 17,000 messages that are self-annotated while playing a game about trust and deception. The language is varied in length, tone, vocabulary, punctuation, and even emojis. Additionally, we create a real-time self-annotation system that annotates deception in a manner not possible through crowd-sourced or automatic methods. The extra effort in data collection will hopefully ensure the longevity of these datasets and galvanize other novel nlp ideas.

However, experts are expensive and limited in number. Hybrid solutions pair potentially unreliable and unverified users in the crowd with experts. We work with Amazon customer service agents to generate and annotate of goal-oriented 81,000 conversations across six domains. Grounding the conversation with a reliable conversationalist—the Amazon agent—creates free-form conversations; using the crowd scales these to the size needed for neural networks.

Examining Committee:
Dean's Representative:
Dr. Jordan Boyd-Graber    
Dr. Philip Resnik    
Dr. Michelle Mazurek    
Dr. Katie Shilton    
Dr. John Dickerson

Denis Peskov is a Ph.D. student in Computer Science advised by Professor Jordan Boyd-Graber. His research creates novel natural language processing datasets and has been supported by the DAAD and ARLIS.  Peskov will join Princeton University as a CIFellows postdoctoral researcher in Spring 2022.

This talk is organized by Tom Hurst