If you notice some errors in the GUM corpus, you can contribute corrections by forking the repository, editing specific files and submitting a pull request. The GUM build bot script will propagate changes to other relevant corpus formats and merge the changes
GUM is distributed in a variety of formats which contain partially overlapping information. For example, almost all formats contain part of speech tags, which need to be in sync across formats. This synchronization is created by the GUM Build Script. As a result, it's important to know exactly where to correct what.
Of the many formats available in the GUM repo, only four are actually used to generate the dataset, with other formats being dynamically generated from these. They are found under the directory _build/src/ in the sub-directories:
All other formats are generated from these files and there is no possibility to edit them (changes will be quashed on the next build process). References to source directories below (e.g. xml/) always refer to these sub-directories (_build/source/xml/).
Because multiple people can contribute corrections simultaneously, merging corrections is managed over github. To contibute corrections directly, you should always:
Alternatively, if you have minor individual corrections, feel free to open an issue in our Github tracker and describe your change requests as accurately as possible.
Token strings come from the first column of the files in xml/. These should normally not be changed. Changing token strings in any other format has no effect (changes will be overwritten or lead to a conflict and crash).
GUM contains lemmas and three types of POS tags for every token:
You can correct lemmas and extended PTB tags in the xml/ directory. Vanilla PTB tags are produced fully automatically from the extended tags and should not be corrected. Correct the extended tags instead. CLAWS tags are produced by an automatic tagger, but are post-processed to remove errors based on the gold extended PTB tags. As a result, most CLAWS errors can be corrected by correcting the PTB tags. Direct corrections to CLAWS tags are likely to be destroyed by the build script. If you find a CLAWS error despite a correct PTB tag, please let us know so we can improve post-processing.
The XML tags in the xml/ directory are based on the TEI vocabulary. Although the schema for GUM is much simpler than TEI, some nesting restrictions as well as naming conventions apply. Corrections to XML tags can be submitted, however please make sure that the corrected file validates using the XSD schema in the _build directory. Corrections that don't validate will fail to merge. If you feel the schema should be updated to accommodate some correction, please let us know.
Dependency information in the dep/ directory can be corrected directly there. However note that:
Constituent trees in const/ are generated automatically based on the tokenization, POS tags and sentence breaks from the XML files, and cannot be corrected manually at present. We plan to project correction information to them from the dependency trees soon.
Coreference and entity annotations are available in several formats, but all information is projected from the tsv/ directory.
The rst/ directory contains Rhetorical Structure Theory analyses in the rs3 format. You can edit the rhetorical relations and even make finer grained segments, but:
The build script in _build is run like this:
> python build_gum.py [-s SOURCE_DIR] [-t TARGET_DIR] [OPTIONS]
Source and target directories default to _build/src and _build/target if not supplied. Parsing and re-tagging CLAWS tags are optional if those data sources are already available and no POS tag, sentence borders or token string changes have occurred. See below for more option settings.
The build script runs in three stages:
Beyond setting different source and target directories, some flags specify optional behaviors:
A successful run with all options on should look like this: