GUM - The Georgetown University Multilayer Corpus

New: GUM 5.0.0 is out!

GUM logo

GUM is an open source multilayer corpus of richly annotated web texts from eight text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (mostly Creative Commons licenses), so that new texts can be annotated and published with ease. The corpus currently contains the following proportions of texts:

text typesourcetextstokens
News storiesWikinews2114,093
Travel guidesWikivoyage1714,955
How-to guideswikiHow1916,920
Academic writingVarious1210,966
Forum discussionsreddit1210,526

All documents are annotated with a range of annotation layers, most of which are produced or corrected manually. Layers include annotations for:

The corpus is available for download in several formats, including individual annotation layers in various formats, and can be searched online in ANNIS using the ANNIS query language (AQL).

Example queries

Annotation layers

token annotations

Each token in the GUM corpus is manually checked for correct segmentation, manually tagged using the Penn Tagset with TreeTagger extensions (e.g. distinguishing lexical verbs as VV* from auxiliaries VB* and VH*; see here for details). The tokens are automatically lemmatized using the TreeTagger and manually corrected, and a second automatic part of speech tag using the CLAWS5 tag set is added, as well as original Penn Treebank tags.

pospart of speech tags in the Penn/TreeTagger tag set
penn_pospart of speech tags in the original Penn Treebank tag set
claws5alternate part of speech tag using the CLAWS5 tag set
lemmalemma (dictionary entry) for each token
tok_funcconvenience annotation giving the token's dependency function

token annotations

tei - text encoding initiative

The tei layer contains a variety of information relating to document structure and appearance, following the TEI p5 guidelines. Most annotations relate to formatting, but some relate to contents (e.g. date annotations) and coarse linguistic features (e.g. basic sentence spans, and non-normative/erroneous language using the <sic> tag). The following list of annotations gives an overview and some notes:

captioncaption for images in the text
cella table cell
datedate expressions
date_fromstarting date for a range of dates
date_notAfterlatest possible date for an inexact date
date_notBeforeearliest possible date for an inexact date
date_rendformatting of a date expression (e.g. italics, color)
date_toend date for a range of dates
date_whendate in question, normalized to the format yyyy-mm-dd
figuremarks the position of a figure in the text
figure_renda description of the appearance of the figure
gapa gap in the text (e.g. ellipsis marked by an editor)
gap_reasonreason for a gap (e.g. 'omitted')
headmarks a heading
head_renda description of the appearance of the heading (e.g. bold)
hi_renda highlighted section with a description of its appearance (e.g. color)
incident_whoan extralinguistic incident (e.g. coughing), and the person responsible
itemitem or bullet point in a list
item_nitem number
l_na line in poetry with its number
lg_na line group with the group's number
lg_typeline group type (e.g. stanza)
listlist of bullet points
list_typea list type (e.g. bulleted, ordered, etc.)
notea footnote or endnote
note_nthe number of a footnote
note_placelocation of the note, e.g. 'foot'
pa paragraph
p_renda description of the appearance of the paragraph
quotea quotation
refan external reference, usually a hyperlink
ref_targetthe target of the reference (usually a URL, if not ommitted)
rowa table row
sa main sentence span
s_typethe sentence mood / rough speech act (declarative, subjunctive, imperative, question..)
sica section containing an apparent language error, thus in the original
sp_whoa section uttered by a particular speaker with a reference to the speaker
tablea table containing the text
table_colsnumber of columns in a table
table_rendrendering information for a table, e.g. 'boxed'
table_rowsnumber of rows in a table
wa fused word form encompassing more than one token (e.g. can|not)

Some additional annotations originally included in the TEI markup were removed because of consistency problems. If you are interested in obtaining a dataset containing these, please contact us as instructed in the download section. Currently available for version 1 data, but not included by default are the divsection annotations:

div1major, top level section
div1_nsection number for div1
div1_typetype of section (e.g. section, chapter, etc.)
div2same as div1 for a second level nested section
div3same as div1 for a third level nested section
TEI annotations

const - constituent trees

The const layer contains constituent syntax trees, with some function labels included on edges between constituents. This layer was produced using the Stanford Parser based on the gold POS tags, and has not been fully corrected yet.

catsyntactic category of the phrase (e.g. cat="NP")
funcgrammatical function with respect to parent (e.g. cat >[func="TMP"] cat for temporal modifier phrases)
constituent annotations

dep - dependency trees

The dep layer gives a dependency syntax analysis according to two schemes: the Stanford Dependencies manual and Universal Dependencies (see the ud layer below). This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

depa dependency relation between two tokens
functhe dependency function according to the Stanford manual (e.g. "pobj")
constituent annotations

ud - universal dependencies

The ud layer has the Universal Dependencies version of the same syntax trees as the Stanford dep layer. Fully connnected punctuation and Universal Dependencies versions are automatically generated from the gold dep files, as well as universal POS (UPOS) annotations.

uda universal dependency relation between two tokens
deprelthe universal dependency relation according to the UD guidelines (e.g. "obl")
constituent annotations

morph - universal morphological features

the morph layer represents basic inflectional categories, such as Person, Number, Tense, Mood and more. It is produced using CoreNLP from the gold parses of the data and follows Universal Dependencies standards.

Definitedefiniteness, e.g. "Def"
Degreeadjective degree, e.g. "Sup" for superlative
Gendergrammatical gender, e.g. "Fem"
Moodgrammatical mood, e.g. "Ind"
Numbergrammatical number, e.g. "Sing"
NumTypetype of number, e.g. "Card"
Persongrammatical person, e.g. "3"
Polaritynegative polarity, "Neg"
Posspossessiveness, "Yes"
PronTypepronoun type, e.g. "Prs"
Reflexreflexivity, "Yes"
Tensegrammatical tense, e.g. "Past"
VerbFormverb form, e.g. "Fin"
constituent annotations

ref - discourse referents and coreference

The ref layer contains information about discourse referents, including their information structural information status (giv[en], acc[essible] or new) and the type of entity they represent (a subset of the OntoNotes scheme including person, object, abstract, and more; see guidelines). The layer also includes typed coreference edges between mentions of entities, distinguishing ana[phora], cata[phora], appos[ition] and other types of coref[erence].

entityentity type
infstatinformation status (giv[en], acc[essible] or new)
corefa coreference edge (AQL: entity ->coref entity)
typecoreference edge type annotation (ana[phora], cata[phora], appos[ition], coref[erence])
referent annotations coreference annotations

bridge - bridging relations

The bridge layer contains information about discourse referents which are introduced indirectly through a previous mention of a different entity, which would lead one to anticipate the existence of the novel but accessible entity. These include aggr[egate] mention (joining previously separately mentioned entities as 'they'), def[inite] anaphoric bridging (e.g. whole + definite part in "a car ... the wheels", or 'other' types of bridging)

bridgea bridging edge (AQL: entity ->bridge entity)
typebridging edge type annotation (bridge:aggr[egate], bridge:def[inite], bridge:other)
coreference annotations

rst - rhetorical structure

The rst layer provides an analysis of the text in Rhetorical Structure Theory, using a small set of 20 rhetorical relations. Each segment of the text, which may be a sentence, clause or other unit, is integrated into a tree of utterances forming the rhetorical structure of the document. Analyses are created using rstWeb.

(node) kindfor dominating structures (single segment span or group of segments)
(node) typefor group structures (simple span or multinuc)
(edge) typerst edge type (rst relation or multinuc relation)
relnamerst relation name (elaboration, evidence, etc.; see guidelines)
rhetorical structure annotations

Correct and rebuild

You can contribute corrections to GUM if you notice some errors, and you can also rebuild the merged and enriched corpus from the source files. See Contributing and Running the GUM Build Bot

Papers using GUM

This is a (non-exhaustive) list of papers using the GUM corpus, feel free to let us know if you know more:

For other research citing GUM, see also the Semantic Scholar entry for the reference paper.


You may download the entire corpus or separate annotation layers in the following formats. Please make sure to read about reconstructing reddit token data, which is not included in the downloadable version but can be added using a script. If you are interested in other subsets or formats of the data, please contact Amir Zeldes.

relANNIS3.3all (merged), for search with ANNIS
PAULA XMLall (merged), in standoff XML
TreeTagger/CWB/CQPWeb XMLtoken annotations and TEI, including sentence types and speakers
Penn style bracketstokens, pos and const/cat
CoNLL10 dependenciestokens, pos, lemma and dep/func, sentence types and speakers
CoNLL coreference formatuntyped coreference, excluding bridging relations
WebAnno TSV3 formattyped coreference, including bridging, entities and information structure
Rhetorical Structure Theory (rs3)untokenized text with RST analysis

License and attribution information

GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license, as are academic articles and Wikipedia biographies. However wikiHow texts and fiction texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. Data from reddit forum discussions is not made available with the corpus, but can be obtained using a script under the licensing conditions imposed by reddit. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.

As a scholarly citation for the corpus in articles, please use this paper:

   author    = {Amir Zeldes},
   title     = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom},
   journal   = {Language Resources and Evaluation},
   year      = {2017},
   volume    = {51},
   number    = {3},
   pages     = {581--612},
   doi       = {}

Gum annotation team (so far - thanks for participating!)

...and other annotators who wish to remain anonymous!