Overview

GUM's strange cousin

  • Latest stable version: 0.1
  • Latest commit:  
RST tree fragment from a theorem proof in GENTLE

GENTLE is a manually annotated multilayer corpus following the same design and annotation layers as GUM, but of unusual text types. The goal of this corpus is to provide a test set of challenging genres for NLP systems to be evaluated on. In particular, we aim to make data available to support:

  • Evaluating NLP models trained on homogeneous data (or even multi-genre data, such as GUM) to find how much they degrade on out-of-domain data.
  • Describing and understanding unusual text types in linguistic terms, drawing comparisons with other more familiar genres (e.g. mathematical proof is not quite similar to any other genres, while poetry, which seems to be a highly non-conventional genre, is most similar to GUM's fiction genre).

Composition

GENTLE follows the same corpus design as GUM and serves as an extention to it by adding 8 unusual genres:

Text typeSourceDocsTokens
Dictionary entriesWiktionary32,423
Esports commentariesYouTube22,149
Legal documentsWikisource22,288
Medical notesMTSamples42,164
PoetryWikisource52,090
Mathematical proofsProofWiki32,106
SyllabusesGitHub22,431
Threat letterscasetext52,146
Total2617,797