How to build GUM and contribute corrections

If you notice some errors in the GUM corpus, you can contribute corrections by forking the repository, editing specific files and submitting a pull request. The GUM build bot script will propagate changes to other relevant corpus formats and merge the changes

The build bot is also used to reconstruct reddit data, merging all annotations after plain text data has been restored using _build/

Executive Summary (TL;DR)

Where to correct what

GUM is distributed in a variety of formats which contain partially overlapping information. For example, almost all formats contain part of speech tags, which need to be in sync across formats. This synchronization is created by the GUM Build Script. As a result, it's important to know exactly where to correct what.


Of the many formats available in the GUM repo, only four are actually used to generate the dataset, with other formats being dynamically generated from these. They are found under the directory _build/src/ in the sub-directories:

All other formats are generated from these files and there is no possibility to edit them (changes will be quashed on the next build process). References to source directories below (e.g. xml/) always refer to these sub-directories (_build/source/xml/).

Committing your corrections to Github

Because multiple people can contribute corrections simultaneously, merging corrections is managed over github. To contibute corrections directly, you should always:

Alternatively, if you have minor individual corrections, feel free to open an issue in our Github tracker and describe your change requests as accurately as possible.

Correcting token strings

Token strings come from the first column of the files in xml/. These should normally not be changed. Changing token strings in any other format has no effect (changes will be overwritten or lead to a conflict and crash).

Correcting POS tags and lemmas

GUM contains lemmas and three types of POS tags for every token:

You can correct lemmas and extended PTB tags in the xml/ directory. Vanilla PTB tags are produced fully automatically from the extended tags and should not be corrected. Correct the extended tags instead. CLAWS tags are produced by an automatic tagger, but are post-processed to remove errors based on the gold extended PTB tags. As a result, most CLAWS errors can be corrected by correcting the PTB tags. Direct corrections to CLAWS tags are likely to be destroyed by the build script. If you find a CLAWS error despite a correct PTB tag, please let us know so we can improve post-processing.

Correcting TEI tags in xml/

The XML tags in the xml/ directory are based on the TEI vocabulary. Although the schema for GUM is much simpler than TEI, some nesting restrictions as well as naming conventions apply. Corrections to XML tags can be submitted, however please make sure that the corrected file validates using the XSD schema in the _build directory. Corrections that don't validate will fail to merge. If you feel the schema should be updated to accommodate some correction, please let us know.


Dependency information in the dep/ directory can be corrected directly there. However note that:

const/ - Constituent trees

Constituent trees in const/ are generated automatically based on the tokenization, POS tags and sentence breaks from the XML files, and cannot be corrected manually at present. Note that token-less data for reddit documents is included in the release under target/const/ for convenience. This data can be used to restore reddit constituent parses using _build/ without having to re-run the Stanford Parser.

coref/ - Coreference and entities

Coreference and entity annotations are available in several formats, but all information is projected from the tsv/ directory.

rst/ - Rhetorical Structure Theory

The rst/ directory contains Rhetorical Structure Theory analyses in the rs3 format. You can edit the rhetorical relations and even make finer grained segments, but:

Running the build script


The build script in _build is run like this:

> python [-s SOURCE_DIR] [-t TARGET_DIR] [OPTIONS]

Source and target directories default to _build/src and _build/target if not supplied. Parsing and re-tagging CLAWS tags are optional if those data sources are already available and no POS tag, sentence borders or token string changes have occurred. See below for more option settings.

The build script runs in three stages:

  1. Validation:
    • check that all source directories have the same number of files
    • check that document names match across directories
    • check that token count matches across formats
    • check that sentence borders match across formats (for xml/ and dep/; the tsv/ sentence borders are adjusted below)
    • validate XML files using XSD schema
  2. Propagation:
    • project token forms from xml/ to all formats
    • project POS tags and lemmas from xml/ to dep/
    • project sentence types and speaker information from xml/ to dep
    • adjust sentence buses borders in Tsv/
    • generate vanilla PTB tags from extended tags
    • (optional) rerun CLAWS tagging and correct based on PTB tags and dependencies (requires TreeTagger)
    • (optional) re-run constituent parser based on updated POS tags (requires Stanford parser)
  3. Convert and merge:
    • generate conll coref format
    • update version metadata
    • merge all data sources using SaltNPepper
    • output merged versions of corpus in PAULA and ANNIS formats

Options and sample output

Beyond setting different source and target directories, some flags specify optional behaviors:

A successful run with including recovering reddit data, with all options on should look like this:

> python -m add
o Processing 6 files in src/xml/...
o Processing 6 files in src/tsv/...
o Processing 6 files in src/dep/...
o Processing 6 files in src/rst/...
o Processing 6 files in target/const/...
Completed fetching reddit data.
You can now use to produce all annotation layers.

> python -c -p -u
Validating files...

Found reddit source data
Including reddit data in build
o Found 101 documents
o File names match
o Token counts match across directories
o 101 documents pass XSD validation

Enriching Dependencies:
o Enriched dependencies in 101 documents

Enriching XML files:
o Retrieved fresh CLAWS5 tags
o Enriched xml in 101 documents

Adjusting token and sentence borders:
o Adjusted 101 WebAnno TSV files
o Adjusted 101 RST files

Regenerating constituent trees:
o Reparsed 101 documents

Creating Universal Dependencies version:
o Converted 101 documents to Universal Dependencies

Starting pepper conversion:

i Pepper reports 201 empty xml spans were ignored
i Pepper says:

Conversion ended successfully, required time: 00:01:05.602 s

(In case of errors you can get verbose pepper output using the -v flag)