GUM7 – four added genres, Wikification and more!

The first release of GUM series 7 now adds four more genres to our multilayer corpus, in addition to brand new annotation layers, corrections, and more. This post outlines the main changes and additions to the corpus.

What is GUM?

GUM is an open source multilayer corpus of richly annotated web texts from eight text types. The corpus is collected and expanded by students as part of the curriculum in LING-367 Computational Corpus Linguistics at Georgetown University. The selection of text types is meant to represent different communicative purposes, while coming from sources that are readily and openly available (mostly Creative Commons licenses), so that new texts can be annotated and published with ease.

New data

Every three years we change the genres to expand in the next edition of GUM. This year we've started to focus more on spoken materials, and the new genres include face to face converations, political speeches, open access textbooks and YouTube vlogs. This is how the data currently breaks down:

Text typeSourceDocumentsTokens
InterviewsWikinews1918,037
News storiesWikinews2114,094
Travel guidesWikivoyage1714,955
How-to guideswikiHow1916,920
Academic writingVarious1615,110
BiographiesWikipedia2017,951
FictionVarious1816,307
Forum discussionsreddit1816,286
ConversationsUCSB Corpus55,698
Political speechesVarious54,831
CC VlogsYouTube55,180
TextbooksOpenStax55,376
Total168150,756

Wikification

This year we've added a new annotation type, which we've retrofitted our old annotations to include as well: Named Entity Linking, or more specifically Wikification. This adds links to Wikipedia for all named entities that have a corresponding Wikipedia article. Unlike other Wikified datasets, GUM includes not only 'atomic' named spans, where a person's name might be linked to Wikipedia, but thanks to our nested coreference annotations, also all mentions of a wikified entity, including pronominal and common noun cases, as well as mentions within mentions. This means the "she" might be linked to French actress Jeanne Moreau in context, and that mentions like the [United States [Congress]] will include links to both entities' Wikipedia pages. Many thanks to Yi-Ju Lin for working on this project!

Nested Wikification

Nested Wikification of mentions of Jeanne Moreau, an award she won, and the festival and location where she won it.

Totally reworked date/time annotations

Following work by Nitin Venkateswaran, the new release has much more extensive date/time annotations, including normalization for seasons and parts of the day (e.g. annotating any 'summer' as not before beginning of June, and not after end of August, etc.). The new data follows practices in other time annotated corpora more closely, and will be used to evaluate day/time prediction accuracy for automated datasets, such as our Amalgum corpus.

New SOTA neural constituent parsing and function labels

Constituent trees in GUM are automatically parsed, except for a small subset of manually annotated test documents, using high accuracy parsing from gold POS tags. Recent advances in parsing have meant that we can now get even more accurate parses, which are now produced by the state of the art neural parser from Mrini et al. (adapted to the GUM build bot by Nitin Venkateswaran).

Another addition in this version is the incorporation of consituent function labels following the original Penn Treebank phrase function labels, such as NP-SBJ for subject NPs, subtypes of adverbials and PPs (ADVP-TMP, ADVP-MNR, PP-LOC, PP-DIR etc.). The GUM build bot now projects function labels onto the constituent trees, which are searchable in ANNIS as shown below.

Constituent trees with function labels

Constituent trees enriched with functional labels, such as SBJ, NOM, DIR.

Better consistency and new morphology

Since GUM6, we have been working on better convergence with existing standards, including revising RST segmentation to match the RST Discourse Treebank exactly, as well as convergence in POS tagging and dependency parsing with OntoNotes and the English Web Treebank.

As part of our efforts to match UD annotation conventions, we have completely reimplemented our morphological tagging pipeline, which now outputs all features expected by Universal Dependencies, including propagation of person, number and case information, including on unmarked verb forms (VBD, VBP), which now indicate the subject person and number.

Morphological annotation grid

Morphological annotations following the UD standard.

Separate annotation of split antecedents

GUM7 now better separates split antecedent anaphora, illustrated below, with a special information status value split, as well as a dedicated edge type.

Goode  
has been in a relationship with
Sophie Dymoke
since 2005.
They 
married in 2014.

(They split-refers to → Goode, Sophie Symoke)

[Open in ANNIS]