Among the trends in corpus technology in the past few years, two developments have particularly high potential for advancing the way we work with corpora: the advent of multilayer annotations made possible by tool chains such as CoreNLP (Manning et al. 2014) and the expansion of the base of corpus practitioners and related software to allow for online, collaborative and distributed annotation across space and time. Using NLP and manual annotation in tandem is crucial for the development of high quality resources across content domains, and the task is further complicated by each layer of annotation we wish to add, just as the potential for new insight and applications is multiplied.
In this talk I will present annotation, modelling and retrieval software, related methodologies, and a first evaluation of results using the example of the GUM corpus: a new freely available multilayer resource encompassing multiple genres, collected and edited using collaborative annotation software as part of the Computational Linguistics curriculum at Georgetown University (http://corpling.uis.georgetown.edu/gum). After discussing corpus design for open, extensible corpora, five classroom annotation projects are presented, covering structural markup in TEI XML (http://www.tei-c.org/), multiple part-of-speech tagging and lemmatization, constituent and dependency parsing, information structure, entity and coreference annotation, as well as Rhetorical Structure Theory analysis (Mann & Thompson 1988, Taboada & Mann 2006). Each of these annotation layers is evaluated and together they are merged for search and visualization in ANNIS (Krause & Zeldes 2015), where they can be used to study the interactions between different levels of linguistic description. The evaluation gives some first indications on the expected quality of student annotators with relatively little training on a wide spectrum of tasks, how much better they do than NLP and on what, and what lessons we can learn in terms of best practices. The results show that high quality, richly annotated resources can be created quickly and effectively as part of a linguistics curriculum, opening new possibilities not just for research, but also for student involvement in research from the very beginning of their studies.
Krause, T. & Zeldes, A. (2015). ANNIS3: A new architecture for generic corpus query and visualization. Literary and Linguistic Computing. Early access online: http://dsh.oxfordjournals.org/content/digitalsh/early/2014/12/02/llc.fqu057.full.pdf.
Mann, W. C. & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text 8(3), 243–281.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Baltimore, MD, 55–60.
Taboada, M. & Mann, W. C. (2006). Rhetorical Structure Theory: Looking back and moving ahead. Discourse Studies 8, 423–459.
Amir Zeldes is assistant professor of Computational Linguistics at Georgetown University, specializing in the field of Corpus Linguistics. He has developed software for corpus annotation, search and visualization, and is particularly interested in multilayer corpora, which model concurrent analyses for morphology, syntax, semantics, coreference and more. His theoretical research focuses on the syntax-semantics interface, where meaning and knowledge about the world are mapped onto lexical choice in language-specific ways. His book Productivity in Argument Selection: From Morphology to Syntax explores the idea that constructions have idiosyncratic degrees of innovation that speakers must learn in each language. He has worked on a variety of topics and languages, including second language writing in German, and Natural Language Processing for under-resourced languages, such as Coptic and Hausa.