Talks

Structural Scaffolds for Making Sense of Document Collections + Tokenization in the era of pretrained language models

Joe Barrow and Chenglei Si - University of Maryland

5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Wednesday, March 16, 2022, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Structural Scaffolds for Making Sense of Document Collections (by Joe Barrow)
----
As readers, we often attempt to make sense of (one or more) documents using structure that goes beyond the content itself: a scientist using sections and subsections to "pre-read" a scientific paper or a web searcher trying to make sense of conflicting viewpoints about a topic. This structure helps a reader build mental maps of the information; without them, it is easy to "miss the forest for the trees." In this work, we aim to induce this structure in cases where it is not already explicit, which we refer to as "structural scaffolding." In particular, we focus on two types of scaffolds: topical scaffolds of documents, where we create labeled sections over an unstructured document to support pre-reading, and syntopical scaffolds of document collections, where we identify, group, and present viewpoints from many documents at once. We find that "content-only" approaches build worse scaffolds for both types than approaches that account for both content and context.

Tokenization in the era of pretrained language models (by Chenglei Si)
----
In this talk, we review the different tokenization strategies used in various pretrained language models (PLM). We focus on the granularity of tokenization (sub-character, character, sub-word, word) and compare their pros and cons in performance, efficiency and robustness. In particular, we highlight two of our own works along this line, one on fusing character and sub-word representations in English PLM, and the other on a novel sub-character tokenization scheme designed for Chinese PLM.

Joe will present over Zoom and Chenglei will present in person.
Zoom: https://umd.zoom.us/j/98806584197?pwd=SXBWOHE1cU9adFFKUmN2UVlwUEJXdz09
(passcode if needed: clip)

Bio

Joe Barrow is a PhD Student at UMD, working with Prof. Philip Resnik and Prof. Doug Oard. He's interested in building tools that help people learn and make informed decisions. You can learn more at: https://jbarrow.ai

Chenglei Si is an undergraduate at UMD advised by professor Jordan Boyd-Graber. His current research mainly focuses on question answering and generalization. He has published several first-authored papers at ACL and EMNLP. This summer, he will join Microsoft as a research intern to work on GPT-3. You can refer to his homepage https://noviscl.github.io for more information.

This talk is organized by Wei Ai