How to build GUM and contribute corrections
If you notice some errors in the GUM corpus, you can contribute corrections by forking the repository, editing specific files and submitting a pull request. The GUM build bot script will propagate changes to other relevant corpus formats and merge the changes
The build bot is also used to reconstruct reddit data, merging all annotations after plain text data has been restored using _build/process_reddit.py.
- only edit files in _build/src/
- edit POS, lemma and TEI tags in _build/src/xml/
- edit dependencies and functions in _build/src/dep/
- edit entities, information status and coreference in _build/src/tsv/
- edit RST in _build/src/rst/
- you can't edit constituent trees and CLAWS tags
- do not alter tokenization or sentence borders
GUM is distributed in a variety of formats which contain partially overlapping information. For example, almost all formats contain part of speech tags, which need to be in sync across formats. This synchronization is created by the GUM Build Script. As a result, it's important to know exactly where to correct what.
Overview
Of the many formats available in the GUM repo, only four are actually used to generate the dataset, with other formats being dynamically generated from these.
They are found under the directory _build/src/ in the sub-directories:
- xml/ - CWB vertical format XML for token annotations and flat TEI span annotations
- dep/ - dependency syntax in the 10 column conllx (a.k.a. conll10) format
- tsv/ - WebAnno 3 tab separated export format for entity, information status and coreference annotations
- rst/ - rhetorical structure theory analyses in the rs3 format as used by rstWeb
All other formats are generated from these files and there is no possibility to edit them (changes will be quashed on the next build process). References to source directories below (e.g. xml/) always refer to these sub-directories (_build/source/xml/).
Committing your corrections to Github
Because multiple people can contribute corrections simultaneously, merging corrections is managed over github.
To contibute corrections directly, you should always:
- Fork the dev branch
- Edit, commit and push to your branch
- Make a pull request into the origin dev branch
Alternatively, if you have minor individual corrections, feel free to open an issue in our Github tracker and describe your change requests as accurately as possible.
Correcting token strings
Token strings come from the first column of the files in xml/. These should normally not be changed. Changing token strings in any other format has no effect (changes will be overwritten or lead to a conflict and crash).
Correcting POS tags and lemmas
GUM contains lemmas and three types of POS tags for every token:
- 'Vanilla' PTB tags following Santorini (1990)
- Extended PTB tags as used by TreeTagger (Schmid 1994)
- CLAWS5 tags as used in the BNC
You can correct lemmas and extended PTB tags in the xml/ directory. Vanilla PTB tags are produced fully automatically from the extended tags and should not be corrected. Correct the extended tags instead.
CLAWS tags are produced by an automatic tagger, but are post-processed to remove errors based on the gold extended PTB tags. As a result, most CLAWS errors can be corrected by correcting the PTB tags. Direct corrections to CLAWS tags are likely to be destroyed by the build script. If you find a CLAWS error despite a correct PTB tag, please let us know so we can improve post-processing.
Correcting TEI tags in xml/
The XML tags in the xml/ directory are based on the TEI vocabulary. Although the schema for GUM is much simpler than TEI, some nesting restrictions as well as naming conventions apply.
Corrections to XML tags can be submitted, however please make sure that the corrected file validates using the XSD schema in the _build directory. Corrections that don't validate will fail to merge.
If you feel the schema should be updated to accommodate some correction, please let us know.
Dependencies
Dependency information in the dep/ directory can be corrected directly there. However note that:
- Only dependency head and function originate in these files
- POS tags and lemmas, as well as sentence type and speaker information come from the xml/ files
- You can't alter tokenization or sentence break information (see below)
- You can't alter Universal Dependencies data in dep/ud/, since it is automatically generated from the Stanford Dependencies. If changes to Stanford Dependencies do not propagate as expected to UD data, please contact us.
const/ - Constituent trees
Constituent trees in const/ are generated automatically based on the tokenization, POS tags and sentence breaks from the XML files, and cannot be corrected manually at present. Note that token-less data for reddit documents is included in the release under target/const/ for convenience. This data can be used to restore reddit constituent parses using _build/process_reddit.py without having to re-run the Stanford Parser.
coref/ - Coreference and entities
Coreference and entity annotations are available in several formats, but all information is projected from the tsv/ directory.
rst/ - Rhetorical Structure Theory
The rst/ directory contains Rhetorical Structure Theory analyses in the rs3 format. You can edit the rhetorical relations and even make finer grained segments, but:
- You cannot edit the tokenization, which is expressed by spaces inside each segment, but ultimately generated from the XML files from xml/
- By convention, you are not allowed to make segments that contain multiple sentences according to the <s> elements in the XML files
Overview
The build script in _build is run like this:
> python build_gum.py [-s SOURCE_DIR] [-t TARGET_DIR] [OPTIONS]
Source and target directories default to _build/src and _build/target if not supplied. Parsing and re-tagging CLAWS tags are optional if those data sources are already available and no POS tag, sentence borders or token string changes have occurred. See below for more option settings.
The build script runs in three stages:
- Validation:
- check that all source directories have the same number of files
- check that document names match across directories
- check that token count matches across formats
- check that sentence borders match across formats (for xml/ and dep/; the tsv/ sentence borders are adjusted below)
- validate XML files using XSD schema
- Propagation:
- project token forms from xml/ to all formats
- project POS tags and lemmas from xml/ to dep/
- project sentence types and speaker information from xml/ to dep
- adjust sentence buses borders in Tsv/
- generate vanilla PTB tags from extended tags
- (optional) rerun CLAWS tagging and correct based on PTB tags and dependencies (requires TreeTagger)
- (optional) re-run constituent parser based on updated POS tags (requires Stanford parser)
- Convert and merge:
- generate conll coref format
- update version metadata
- merge all data sources using SaltNPepper
- output merged versions of corpus in PAULA and ANNIS formats
Options and sample output
Beyond setting different source and target directories, some flags specify optional behaviors:
- -p - Re-parse the data using the Stanford Parser based on current tokens, sentence borders and POS tags (requires Stanford Parser and correct path settings in paths.py)
- -c - Re-tag CLAWS5 tags (requires TreeTagger configured in path settings in paths.py, and the supplied CLAWS training model in utils/treetagger/lib/)
- -v - Verbose Pepper output - useful for debugging the merge step on errors
A successful run with including recovering reddit data, with all options on should look like this:
amir@GITM _build
> python process_reddit.py -m add
o Processing 6 files in src/xml/...
o Processing 6 files in src/tsv/...
o Processing 6 files in src/dep/...
o Processing 6 files in src/rst/...
o Processing 6 files in target/const/...
Completed fetching reddit data.
You can now use build_gum.py to produce all annotation layers.
amir@GITM _build
> python build_gum.py -c -p -u
====================
Validating files...
====================
Found reddit source data
Including reddit data in build
o Found 101 documents
o File names match
o Token counts match across directories
o 101 documents pass XSD validation
Enriching Dependencies:
=======================
o Enriched dependencies in 101 documents
Enriching XML files:
=======================
o Retrieved fresh CLAWS5 tags
o Enriched xml in 101 documents
Adjusting token and sentence borders:
========================================
o Adjusted 101 WebAnno TSV files
o Adjusted 101 RST files
Regenerating constituent trees:
==============================
o Reparsed 101 documents
Creating Universal Dependencies version:
========================================
o Converted 101 documents to Universal Dependencies
Starting pepper conversion:
==============================
i Pepper reports 201 empty xml spans were ignored
i Pepper says:
Conversion ended successfully, required time: 00:01:05.602 s
(In case of errors you can get verbose pepper output using the -v flag)