Papers
Publications and citation information
License and attribution information
GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license, as are academic articles, Wikipedia biographies, OpenStax textbooks and YouTube vlogs (retrieved using YouTube's Creative Commons filtered search). Some of the political speeches included in the corpus did not specify exact licenses, but are made available by official government and UN websites which indicate that these speeches are in the public domain, and not subject to copyright. Conversations from the Santa Barbara Corpus have been made available for annotation in GUM under the CC-BY license, courtesy of Jack DuBois (UCSB).
However please note that wikiHow texts and fiction texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. Data from reddit forum discussions is not made available with the corpus, but can be obtained using a script under the licensing conditions imposed by reddit. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.
As a scholarly citation for the corpus in articles, please use this paper:
- Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612.
@Article{Zeldes2017, author = {Amir Zeldes}, title = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom}, journal = {Language Resources and Evaluation}, year = {2017}, volume = {51}, number = {3}, pages = {581--612}, doi = {http://dx.doi.org/10.1007/s10579-016-9343-x} }
Papers using GUM
This is a (non-exhaustive) list of papers using the GUM corpus, feel free to let us know if you know more:
- Zeldes, Amir and Simonson, Dan (2016) Different Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality. In: Proceedings of LAW X - The 10th Linguistic Annotation Workshop at the Annual Meeting of the ACL. Berlin, 68-78.
- Zeldes, Amir (2016) rstWeb - A Browser-based Annotation Interface for Rhetorical Structure Theory and Discourse Relations. In: Proceedings of NAACL-HLT 2016 System Demonstrations. San Diego, CA, 1-5.
- Horsmann, Tobias, Erbs, Nicolai and Zesch, Torsten (2016), Fast or Accurate? – A Comparative Evaluation of PoS Tagging Models. In: Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. Duisburg, Germany, 22-30.
- Zeldes, Amir and Zhang, Shuo (2016) When Annotation Schemes Change Rules Help: A Configurable Approach to Coreference Resolution beyond OntoNotes. In: Proceedings of the NAACL2016 Workshop on Coreference Resolution Beyond OntoNotes (CORBON). San Diego, CA, 92-101.
- Wojatzki, Michael, Melamud, Oren and Zesch, Torsten (2016) Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises. In: Proceedings of the Building Educational Applications Workshop at NAACL 2016. San Diego, CA, 172-181.
- Plank, Barbara (2016) What to do about non-standard (or non-canonical) language in NLP. In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). Bochum, Germany, 13-20.
- Meyer, Niklas, Wojatzki, Michael and Zesch, Torsten (2016) Validating Bundled Gap Filling – Empirical Evidence for Ambiguity Reduction and Language Proficiency Testing Capabilities. In: Proceedings of the NLP4CAL at SLTC 2016. Umea, Sweden, 2016.
- Horsmann, Tobias and Zesch, Torsten (2016) Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging. In: Proceedings of COLING 2016. Osaka, 328-336.
- Krause, Thomas, Leser, Ulf and Lüdeling, Anke (2016) graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora. Journal for Language Technology and Computational Linguistics 31(1), 1-25.
- Zeldes, Amir (2017) The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation 51(3), 581-612. (this is the reference paper for citing the corpus)
- Zeldes, Amir (2017) A Distributional View of Discourse Encapsulation: Multifactorial Prediction of Coreference Density in RST. In: 6th Workshop on Recent Advances in RST and Related Formalisms at INLG. Santiago de Compostela, Spain.
- Peng, Siyao and Zeldes, Amir (2018) All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations. In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) at COLING2018. Santa Fe, NM, 167-177.
- Rodriguez, Juan Diego, Caldwell, Adam and Liu, Alexander (2018). Transfer Learning for Entity Recognition of Novel Classes. In Proceedings of COLING 2018. Santa Fe, NM, 1974-1985.
- Zeldes, Amir (2018) A Multi-Dimensional Analysis of RST Discourse Relations in Eight Genres. In: 14th American Association of Corpus Linguistics Conference (AACL 2018). Atlanta, GA.
- Peng, Siyao and Zeldes, Amir (2018) Validating and Merging a Growing Multilayer Corpus – the Case of GUM. In: 14th American Association of Corpus Linguistics Conference (AACL 2018). Atlanta, GA.
- Prange, Jakob, Schneider, Nathan and Abend, Omri (2019) Semantically Constrained Multilayer Annotation: The Case of Coreference. In: First International Workshop on Designing Meaning Representations (DMR). Florence, Italy.
- Yan, Jianwei and Liu, Haitao (2019) Which annotation scheme is more expedient to measure syntactic difficulty and cognitive demand?. In: Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019). Paris, France,16-24.
- Philippe Muller, Chloé Braud, Mathieu Morey (2019) ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents. In: Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN, 115–124.
- Gessler, Luke, Peng, Siyao, Liu, Yang, Zhu, Yilun, Behzad, Shabnam and Zeldes, Amir (2020) AMALGUM – A Free, Balanced, Multilayer English Web Corpus. In: Proceedings of LREC 2020. Marseille, France, 5267-5275.
- Behzad, Shabnam and Zeldes, Amir (2020) A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging. In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII). Marseille, France, 50–56.
- Sanguinetti, Manuela, Bosco, Cristina, Cassidy, Lauren, Çetinoğlu, Özlem, Cignarella, Alessandra Teresa, Lynn, Teresa, Rehbein, Ines, Ruppenhofer, Josef, Seddah, Djamé and Zeldes, Amir (2020) Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies. In: Proceedings of LREC 2020. Marseille, France, 5240-5250.
- Hoo, Yutai Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu and Ting Liu (2020) Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In: Proceedings of ACL 2020. Seattle, WA.
- Lan, Ouyu, Huang, Xiao, Lin, Bill Yuchen, Jiang, He, Liu, Liyuan and Ren, Xiang (2020) Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling. In Proceedings of ACL 2020, 2134-2146.
For other research citing GUM, see also the Semantic Scholar entry for the reference paper.