GUM - The Georgetown University Multilayer Corpus


New: Version 3.1.0 (2017-05-01) -
revised and corrected version of GUM 3 with data from 2016!

GUM logo

GUM is an open source multilayer corpus of richly annotated web texts from four text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (Creative Commons licenses), so that new texts can be annotated and published with ease. The corpus currently contains the following proportions of texts:

text typesourcetextstokens
Interviews (conversational)Wikinews1918037
News (narrative)Wikinews2114093
Travel guides (informative)Wikivoyage1714955
How-tos (instructional)wikiHow1916920
Total7664005

All documents are annotated with a range of annotation layers, most of which are produced or corrected manually. Layers include annotations for:

The corpus is available for download in several formats, including individual annotation layers in various formats, and can be searched online in ANNIS using the ANNIS query language (AQL).

Example queries

Annotation layers

token annotations

Each token in the GUM corpus is manually checked for correct segmentation, manually tagged using the Penn Tagset with TreeTagger extensions (e.g. distinguishing lexical verbs as VV* from auxiliaries VB* and VH*; see here for details). The tokens are automatically lemmatized using the TreeTagger and manually corrected, and a second automatic part of speech tag using the CLAWS5 tag set is added, as well as original Penn Treebank tags.

pospart of speech tags in the Penn/TreeTagger tag set
penn_pospart of speech tags in the original Penn Treebank tag set
claws5alternate part of speech tag using the CLAWS5 tag set
lemmalemma (dictionary entry) for each token
tok_funcconvenience annotation giving the token's dependency function

token annotations

tei - text encoding initiative

The tei layer contains a variety of information relating to document structure and appearance, following the TEI p5 guidelines. Most annotations relate to formatting, but some relate to contents (e.g. date annotations) and coarse linguistic features (e.g. basic sentence spans, and non-normative/erroneous language using the <sic> tag). The following list of annotations gives an overview and some notes:

captioncaption for images in the text
datedate expressions
date_fromstarting date for a range of dates
date_notAfterlatest possible date for an inexact date
date_notBeforeearliest possible date for an inexact date
date_rendformatting of a date expression (e.g. italics, color)
date_toend date for a range of dates
date_whendate in question, normalized to the format yyyy-mm-dd
figuremarks the position of a figure in the text
figure_renda description of the appearance of the figure
headmarks a heading
head_renda description of the appearance of the heading (e.g. bold)
hi_renda highlighted section with a description of its appearance (e.g. color)
incident_whoan extralinguistic incident (e.g. coughing), and the person responsible
itemitem or bullet point in a list
item_nitem number
l_na line in poetry with its number
lg_na line group with the group's number
lg_typeline group type (e.g. stanza)
listlist of bullet points
list_typea list type (e.g. bulleted, ordered, etc.)
pa paragraph
p_renda description of the appearance of the paragraph
quotea quotation
refan external reference, usually a hyperlink
ref_targetthe target of the reference (usually a URL, if not ommitted)
sa main sentence span
s_typethe sentence mood / rough speech act (declarative, subjunctive, imperative, question..)
sica section containing an apparent language error, thus in the original
sp_whoa section uttered by a particular speaker with a reference to the speaker
wa fused word form encompassing more than one token (e.g. can|not)

Some additional annotations originally included in the TEI markup were removed because of consistency problems. If you are interested in obtaining a dataset containing these, please contact us as instructed in the download section. Currently available for version 1 data, but not included by default are the divsection annotations:

div1major, top level section
div1_nsection number for div1
div1_typetype of section (e.g. section, chapter, etc.)
div2same as div1 for a second level nested section
div2_n
div2_type
div3same as div1 for a third level nested section
div3_n
TEI annotations

const - constituent trees

The const layer contains constituent syntax trees, with some function labels included on edges between constituents. This layer was produced using the Stanford Parser and has not been fully corrected yet.

catsyntactic category of the phrase (e.g. cat="NP")
funcgrammatical function with respect to parent (e.g. cat >[func="TMP"] cat for temporal modifier phrases)
constituent annotations

dep - dependency trees

The dep layer gives a dependency syntax analysis according to the Stanford Dependencies manual. This layer was intially produced using the Stanford Parser and then manually corrected using the Arborator collaborative syntax annotation software. For the annotation project we used non-collapsed dependencies, and dependencies for punctuation tokens have been removed.

depa dependency relation between two tokens
functhe dependency function according to the Stanford manual (e.g. "nsubj")
constituent annotations

ref - discourse referents and coreference

The ref layer contains information about discourse referents, including their information structural information status (giv[en], acc[essible] or new) and the type of entity they represent (a subset of the OntoNotes scheme including person, object, abstract, and more; see guidelines). The layer also includes typed coreference edges between mentions of entities, distinguishing ana[phora], cata[phora], appos[ition], bridge (bridging relationships) and other types of coref[erence].

entityentity type
infstatinformation status (giv[en], acc[essible] or new)
corefa coreference edge (AQL: entity ->coref entity)
typecoreference edge type annotation (ana[phora], cata[phora], appos[ition], bridge, coref[erence])
referent annotations coreference annotations

rst - rhetorical structure

The rst layer provides an analysis of the text in Rhetorical Structure Theory, using a small set of 20 rhetorical relations. Each segment of the text, which may be a sentence, clause or other unit, is integrated into a tree of utterances forming the rhetorical structure of the document. Analyses are created using rstWeb.

(node) kindfor dominating structures (single segment span or group of segments)
(node) typefor group structures (simple span or multinuc)
(edge) typerst edge type (rst relation or multinuc relation)
relnamerst relation name (elaboration, evidence, etc.; see guidelines)
rhetorical structure annotations

Correct and rebuild

You can contribute corrections to GUM if you notice some errors, and you can also rebuild the merged and enriched corpus from the source files. See Contributing and Running the GUM Build Bot

Papers using GUM

This is a (non-exhaustive) list of papers using the GUM corpus, feel free to let us know if you know more:

Download

You may download the entire corpus or separate annotation layers in the following formats. If you are interested in other subsets or formats of the data, please contact Amir Zeldes.

FormatAnnotations
relANNIS3.3all (merged), for search with ANNIS
PAULA XMLall (merged), in standoff XML
TreeTagger/CWB/CQPWeb XMLtoken annotations and TEI, including sentence types and speakers
Penn style bracketstokens, pos and const/cat
CoNLL10 dependenciestokens, pos, lemma and dep/func, sentence types and speakers
CoNLL coreference formatuntyped coreference, excluding bridging relations
WebAnno TSV3 formattyped coreference, including bridging, entities and information structure
Rhetorical Structure Theory (rs3)untokenized text with RST analysis

License and attribution information

GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license. However wikiHow texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.

As a scholarly citation for the corpus in articles, please use this paper:

Gum annotation team (so far - thanks for participating!)

...and other annotators who wish to remain anonymous!