Annotations

Individual annotation layer documentation

The GUM corpus contains a large number of concurrent annotations which can be grouped into 'layers'. Each layer is structurally independant of other layers, and often created using different tools and at different times, though the build-bot used to correct the corpus (see Corrections) enforces some consistency between layers (for example: constituent syntax and dependency syntax layers use the same sentence boundaries). The following layers are currently included in the corpus:

  • tok - multiple parts of speech, segmentation and lemmatization for each token
  • tei - document structure, links, ISO date/time, sentence types, errors and more
  • const - Penn Treebank-style trees, including phrase function labels
  • dep - Universal Dependencies (UD) trees
  • edep - Enhanced dependency graphs
  • morph - morphological categories based on the UD inventory
  • ref - nested, named and non-named entity types, coreference, information status and Wikification
  • bridge - bridging anaphora and split antecedent coreference
  • rst - discourse parses in Rhetorical Structure Theory++
  • rsd - dependency version of discourse annotations
  • meta - metadata and document level annotations
Guidelines for these annotations generally follow accepted standards, and GUM is part of and conforms to initiatives including Universal Dependencies and Universal Anaphora. Individual guidelines for specific annotations and cases arising from the annotation of GUM genres in particular are documented in our Wiki.

tok - token annotations

Each token in the GUM corpus is manually checked for correct segmentation, manually tagged using the Penn Tagset with TreeTagger extensions (e.g. distinguishing lexical verbs as VV* from auxiliaries VB* and VH*; see here for details, and the original PTB tagging guidelines without extended tags here). The tokens are automatically lemmatized using Stanza and manually corrected, and a second automatic part of speech tag using the CLAWS5 tag set is added, as well as original Penn Treebank tags. This phase of the annotation is done using the GitDox annotation interface.

pospart of speech tags in the Penn/TreeTagger tag set
xpospart of speech tags in the original Penn Treebank tag set
uposGoogle universal part of speech tags
claws5alternate part of speech tag using the CLAWS5 tag set
lemmalemma (dictionary entry) for each token
msegmorphological segmentation (e.g. un-believ-able)
tok_funcconvenience annotation giving the token's dependency function
For tokenization and tagging guidelines specific to GUM, see our guidelines on segmentation and tagging & lemmatization. token annotations

tei - text encoding initiative

The tei layer contains a variety of information relating to document structure and appearance, following the TEI p5 guidelines. Most annotations relate to formatting, but some relate to contents (e.g. date annotations) and coarse linguistic features (e.g. basic sentence spans, and non-normative/erroneous language using the <sic> tag and the @ana attribute with a corrected target hypothesis). The following list of annotations gives an overview and some notes, and guidelines can be found here:

captioncaption for images in the text
cella table cell
datedate expressions
date_fromstarting date for a range of dates
date_notAfterlatest possible date for an inexact date
date_notBeforeearliest possible date for an inexact date
date_rendformatting of a date expression (e.g. italics, color)
date_toend date for a range of dates
date_whendate in question, normalized to the format yyyy-mm-dd
figuremarks the position of a figure in the text
figure_renda description of the appearance of the figure
foreign_xml_langISO code for language of non-English words
gapa gap in the text (e.g. ellipsis marked by an editor)
gap_reasonreason for a gap (e.g. 'omitted')
headmarks a heading
head_renda description of the appearance of the heading (e.g. bold)
hi_renda highlighted section with a description of its appearance (e.g. color)
incident_whoan extralinguistic incident (e.g. coughing), and the person responsible
itemitem or bullet point in a list
item_nitem number
l_na line in poetry with its number
lg_na line group with the group's number
lg_typeline group type (e.g. stanza)
listlist of bullet points
list_typea list type (e.g. bulleted, ordered, etc.)
notea footnote or endnote
note_nthe number of a footnote
note_placelocation of the note, e.g. 'foot'
pa paragraph
p_renda description of the appearance of the paragraph
quotea quotation
refan external reference, usually a hyperlink
ref_targetthe target of the reference (usually a URL, if not ommitted)
rowa table row
sa main sentence span
s_typethe sentence mood / rough speech act (declarative, subjunctive, imperative, question..)
sica section containing an apparent language error, thus in the original
sic_anaa corresponding reconstructed target hypothesis in standard English
sp_whoa section uttered by a particular speaker with a reference to the speaker
sp_whoma section uttered with a particular speaker as an addressee
tablea table containing the text
table_colsnumber of columns in a table
table_rendrendering information for a table, e.g. 'boxed'
table_rowsnumber of rows in a table
wa fused word form encompassing more than one token (e.g. can|not)
TEI annotations

const - constituent trees

The const layer contains constituent syntax trees, with some function labels included on edges between constituents. This layer was produced using the Neural Adobe-UCSD Parser based on the gold POS tags, and aside from 8 test documents has not been fully corrected yet. Function labels for the constituents, such as "NP-SBJ" have been added automatically using a projection algorithm relying on the gold standard syntactic dependency labels from the dep layer.

catsyntactic category of the phrase (e.g. cat="NP")
funcgrammatical function with respect to parent (e.g. cat >[func="MNR"] cat for manner adverbial modifier phrases)
constituent annotations

dep - dependency trees

The dep layer gives a dependency syntax analysis according to Universal Dependencies. This layer is initially produced using Stanza operating on gold tokens and POS tags, and is then manually corrected using the Arborator collaborative syntax annotation software. We follow general UD guidelines, and specific instructions for constructions found in our data are documented in our guidelines.

depa dependency relation between two tokens
functhe universal dependency function according to the UD guidelines
dependency annotations

edep - enhanced dependencies

The edep layer adds an enhanced graph representation with structure sharing, which more closely reflects semantic argument structure (see the guidelines). This layer is produced semi-automatically by propagating structure sharing across coordination, subject and object control and more, and is then adjusted including the introduction of 'virtual' tokens to cover ellipsis, gapping, right-node-raising and related phenomena. This layer also provides augmented label types including lexical subtypes, such as obl:on to indicate an oblique PP modifier headed by 'on', or conj:or to indicate a disjunction marked by 'or'.

edepan enhanced dependency edge between two tokens (incl. multiple edges per token)
functhe enhanced dependency function according to the UD guidelines
Ellipsisa virtual token node representing an elided token with an argument structure role
enhanced dependencies virtual ellipsis token

morph - universal morphological features

the morph layer represents basic inflectional categories, such as Person, Number, Tense, Mood and more. It is produced using a DepEdit script from the gold parses of the data, and follows Universal Dependencies standards.

Abbrabbreviation, "Yes"
Definitedefiniteness, e.g. "Def"
Degreeadjective/adverb degree, e.g. "Sup" for superlative
Gendergrammatical gender, e.g. "Fem"
Moodgrammatical mood, e.g. "Ind"
Numbergrammatical number, e.g. "Sing"
NumFormorthographic form, e.g. "Roman"
NumTypetype of number, e.g. "Card"
Persongrammatical person, e.g. "3"
Polaritynegative polarity, "Neg"
Posspossessiveness, "Yes"
PronTypepronoun type, e.g. "Prs"
Reflexreflexivity, "Yes"
Tensegrammatical tense, e.g. "Past"
VerbFormverb form, e.g. "Fin"
Cxnhierarchical Construction Grammar label, e.g. "Condition-Unrealistic-Inverted"
constituent annotations

ref - discourse referents and coreference

The ref layer contains information about discourse referents, including their information structural information status (discourse new, given:active, given:inactive, accessible:inferrable, accessible:commonground, and accessible:aggregate for split antecedents), salience (salient or non-salient), and the type of entity they represent (a subset of the OntoNotes scheme including person, object, abstract, and more; see guidelines). Named entities, including their pronominal and non-named mentions, are also linked to their Wikipedia identifier provided they have a Wikipedia article. The ref layer also includes typed coreference edges between mentions of entities (including nested, non-named and pronominal mentions), distinguishing ana[phora], cata[phora], appos[ition] and other types of coref[erence]. All annotations are reviewed manually and corrected in the GitDox interface's Spannotator extension.

entityentity type
identityWikification identifier
infstatinformation status (giv[en]-act/inact, acc[essible]:inf/com/aggr, or new)
saliencewhether the mention belongs to a salient entity (only in .tsv format; see metadata in conllu)
corefa coreference edge (AQL: entity ->coref entity)
typecoreference edge type annotation (ana[phora], cata[phora], appos[ition], disc[ourse], pred[ication], coref[erence])
referent annotations coreference annotations

bridge - bridging relations

The bridge layer contains information about discourse referents which are introduced indirectly through a previous mention of a different entity, which would lead one to anticipate the existence of the novel but accessible entity (see guidelines). These include aggr[egate] mention (i.e. split antecedent, joining previously separately mentioned entities as 'they'), def[inite] anaphoric bridging (e.g. whole + definite part in "a car ... the wheels", or 'other' types of bridging)

bridgea bridging edge (AQL: entity ->bridge entity)
typebridging edge type annotation (bridge:aggr[egate], bridge:def[inite], bridge:other)
coreference annotations

rst - rhetorical structure

The rst layer provides an analysis of the text in eRST, an enhanced version of Rhetorical Structure Theory, using a set of 32 rhetorical relations, arranged at two hierarchical levels. Each segment of the text, which may be a sentence, clause or other unit, is integrated into a tree of utterances forming the rhetorical structure of the document. Segmentation guidelines are identical to the guidelines for the RST Discourse Treebank, and structuring and discourse relation guidelines can be found here. Trees are augmented with tree-breaking secondary edges where needed, and relations point to categorized signals indicating how the relations may be recognized based on properties of the text, including via discourse markers (connectives like 'but' or 'because'), punctuation, morphology, layout, coreference and other means. Analyses are created using rstWeb.

(node) kindfor dominating structures (single segment span or group of segments)
(node) typefor group structures (simple span or multinuc)
(edge) typerst edge type (rst relation or multinuc relation)
(edge) enddistinguishes 'source' and 'target' nodes for tree-breaking secondary edges
relnamerst relation name (elaboration-additional, explanation-evidence, etc.; see guidelines)
signal_typea major signaling device type, e.g. 'dm', 'syntactic', 'semantic' etc.
signal_subtypea signal subtype such as 'reported_speech' (syntactic), 'indicative_phrase' (lexical), etc.
signal_texttext in the span of a signal.
signaled_relationthe relation type belonging to the signal.
(edge) signal_tokenthe relation between a signaled relation and the signaling token
rhetorical structure annotations

rsd - rhetorical structure dependencies

The rsd layer gives a dependency conversion of the RST layer, using only the discourse segments and no non-terminal grouping spans or coordinate structures. Discourse units are enriched automatically with a number of annotations.

rsd_relthe discourse relation that the span heads
head_funcroot syntactic dependency function of the span
head_posroot UPOS tag of the span
head_tokroot word form of the span
lenspan length in tokens
pos1first UPOS tag in the span
stypethe sentence type containing the span
subordthe direction of syntactic subordination of the span (LEFT, RIGHT or NONE)
funcdependency relation name as an edge annotation, ends in _m for multinuclear relations, _r otherwise
rhetorical structure annotations
rhetorical structure annotations

meta - metadata and document level annotations

Each document has metadata indicating provenance, document creation time and speaker information, as well as document level annotations, such as a one sentence summary of the text constructed according to the guidelines.

authordocument author(s) or other appropriate attribution source
dateCollecteddate when contents were collected from the source
dateCreatedearliest known date when the source existed
dateModifieddate of the last known modification of the source data before collection
salientEntitiesa comma separated list of unique CoNLL-U IDs for the most salient entities in the document
shortTitlea unique one-word title representing the document
sourceURLlink to the document's original location
speakerCountnumber of speakers (0 for a written text with no speakers)
speakerListlist of speaker IDs used in the annotation (or 'none' if 0 speakers)
summarya one sentence summary according to the guidelines
titleoriginal title at the source of the document (full article title, video title etc.)
typeGUM text type or genre (bio, news, vlog etc.)
meta annotations