Overview

GUM's big brother

  • Latest stable version: 0.1
  • Latest commit:  
Automatic entity and coreference resolution in AMALGUM

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web. In particular, we aim to make data available to support:

  • Pretraining on large scale, silver quality data before fine tuning on smaller gold standard datasets
  • Active learning to supplement training data and iteratively improve AMALGUM's own data
  • Better-than-out-of-the-box quality NLP, using every possible trick as a tool and a target for NLP research

Composition

AMALGUM follows the same corpus design as GUM and currently contains the text types from the GUM version 6 series, with some different sources to allow for the larger scale:

Text typeSourceDocumentsTokens
InterviewsWikinews778500,090
News storiesWikinews778500,090
Travel guidesWikivoyage482500,680
How-to guideswikiHow613500,014
Academic writingMDPI662500,285
BiographiesWikipedia600500,760
FictionProject Gutenberg457500,088
Forum discussionsreddit682500,412
Total4,9604,002,929