Publications and citation information

License and attribution information

GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license, as are academic articles, Wikipedia biographies, OpenStax textbooks and YouTube vlogs (retrieved using YouTube's Creative Commons filtered search). Some of the political speeches included in the corpus did not specify exact licenses, but are made available by official government and UN websites which indicate that these speeches are in the public domain, and not subject to copyright. Conversations from the Santa Barbara Corpus have been made available for annotation in GUM under the CC-BY license, courtesy of Jack DuBois (UCSB).

However please note that wikiHow texts and fiction texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. Data from reddit forum discussions is not made available with the corpus, but can be obtained using a script under the licensing conditions imposed by reddit. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.

As a scholarly citation for the corpus in articles, please use this paper:

   author    = {Amir Zeldes},
   title     = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
   journal   = {Language Resources and Evaluation},
   year      = {2017},
   volume    = {51},
   number    = {3},
   pages     = {581--612},
   doi       = {}

Papers using GUM

This is a (non-exhaustive) list of papers using the GUM corpus, feel free to let us know if you know more:

For other research citing GUM, see also the Semantic Scholar entry for the reference paper.