Talks

Search Beyond the Digital: Finding Undigitized Items in Archival Repositories

Doug Oard (UMD), Tokinori Suzuki (Kyushi University), Emi Ishita (Kyushi University)

5105 Brendan Iribe Center for Computer Science and Engineering (IRB)

Wednesday, March 6, 2024, 11:00 am-12:00 pm

You are subscribed to this talk through .
You are watching this talk through .
You are subscribed to this talk. (unsubscribe, watch)
You are watching this talk. (unwatch, subscribe)
You are not subscribed to this talk. (watch, subscribe)

Abstract

Information retrieval has for decades focused on finding digital documents, including documents that were born digital and those that have been digitized. But there are also enormous collections of physical documents, on paper or microfilm for example, that are not likely to be fully digitized in our lifetimes. For example, The U.S. National Archives and Records Administration (NARA) presently holds 11.7 billion pages, only about 2% of which is presently either in digital or digitized form. This is just one among many thousands of archival repositories; with more than 26,000 such repositories in just the United States, for example. Access to the culturally important materials that these repositories curate is presently mediated largely through high-level descriptions of entire collections that have been written by archivists, along with detailed descriptions of how some of those collections are organized. In this talk, we will describe a project in which we seek to build on that descriptive work, both by leveraging the limited amount of digitization that has been performed and by assembling descriptions of archival content from published materials such as journal articles or books. We’ll describe two sets of experiments. In the first, for U.S. State Department documents stored at NARA we asked whether we could guess which box to look in to satisfy a query, based on having digitized just a few documents from each box. In the second, we asked whether we could find citations to archival materials in scholarly literature. We’ll use the results of these experiments to motivate our broader research program in which we seek to model the content of unseen documents based on multiple sources of evidence about other documents in the same collection, and in which we seek to enrich that evidence by helping scholars who are working in archives to expand what we know about the contents of those repositories. This is joint work with David Doermann, Katrina Fenlon, Diana Marsh and Yoichi Tomiura.

Bio

Doug Oard is a Professor at the University of Maryland, with joint appointments in the College of Information Studies and UMIACS. He is perhaps best known for his research on Cross-Language Information Retrieval (CLIR), but more generally his research has addressed the use of technologies such as machine translation, speech recognition, document image analysis, knowledge representation, processing mathematical notation, and social network analysis to support information access. More information can be found at http://terpconnect.umd.edu/~oard

Tokinori Suzuki is an Assistant Professor in the Department of Informatics at Kyushu University, Japan. He is the principal investigator of the development of an advanced search system for documents in archives project, which is supported by the Japan Society for the Promotion of Science. His research interests include information access technologies, information retrieval and web mining. He will be a visiting scholar in the UMIACS CLIP Lab for the 2024-2025 academic year. More information can be found at https://researchmap.jp/tokinori?lang=en

Emi Ishita is Professor at Kyushu University (Japan), with appointments in the Department of Library Science (iSchool) and the Research Data Service Division at the University’s Data-Driven Innovation Initiative. Her research interests include text classification, computational social science, library and information science education, and research data management. More information can be found at https://hyoka.ofc.kyushu-u.ac.jp/search/details/K003977/english.html

This talk is organized by Rachel Rudinger