User Tools

Site Tools


Tagging and lemmatization


The extended POS tagset used in GUM:

Do and Don't

Tagging “don't” (or really two tokens, “do” and “n't”): the verb 'do' is not considered an auxiliary in the PTB scheme in the sense of having a special tag. If it’s a present form like “I don’t do X” then the first ‘do’ is VVP and the second 'do' is VV (a base form); if it’s an imperative like “Don't go!”, it’s VV for both verbs (imperative is not considered present). The negation ‘not’ and also the form ‘n’t’ is considered adverbial (compare ‘very good’ vs. ‘not good’ – both modifiers are adverbs). As a result, it’s tagged RB.

Proper Nouns and Titles

Titles of books, films, etc.: tokens are considered NP or NPS if they are capitalized, but function words are tagged as normal. So for Starship Troopers, both words are considered ‘proper’ and tagged: Starship = NP and Troopers = NPS. But for “Beauty and the Beast” we get: NP, CC, DT, NP

Comparatives with more / less

In cases like “more interesting”, we have two tokens - ‘more’ itself is tagged JJR, but 'interesting' is still just a normal JJ. If you're counting comparatives in the corpus, counting JJR still gets you

Number Ranges

When a hyphen or dash appears in a number or date range, it means (and would be pronounced as) 'to', and is therefore tagged TO.

  • August 2 , 1754 –TO June 14 , 1825
  • twenty -TO thirty minutes


When an error (annotated with sic in markup) is a grammatically plausible construction, tag the word as it is found in the text, rather than what it “should” be:

  • I <sic>knownVVN</sic> it

Misspellings are tagged as if they were correctly spelled, even if the misspelling has the form of a different word. For example, if 'too' appears in a construction where only 'to' would be grammatically conceivable, it is considered a misspelling of 'to' and tagged accordingly:

  • I want <sic>tooTO</sic> go

A disfluent token is tagged based on what you think it would have been had it not been disfluent:

  • The temperature of this w-NN waterNN here

If it's hard to know with reasonable certainty what a disfluent token was “supposed” to have been, consider using UH or SYM.

CD vs. PP

The generic pronoun “one” is tagged PP, not CD:

  • One/PP wonders!

However some uses of one in reference to people are still CD:

  • They entered one/CD by one/CD
  • One/CD of the guards

WP vs. WDT

Use WDT for which, as well as that when it is used as a relative pronoun:

  • … causing symptoms thatWDT show up years later …
  • The plant , whichWDT is owned by Hollingsworth …

And use WP for what, who, and whom:

  • WhatWP this tells us is that U.S. trade law is working …
  • Mr. Cray , whoWP could n't be reached for comment …
  • It's the petulant complaint of an impudent American whomWP Sony hosted for a year while he was on a Luce Fellowship in Tokyo …
  • I'll get you whateverWP you want.

But use WDT and not WP when what is modifying a noun:

  • … no one will check to determine whatWDT notesNNS a person has taken .
  • I usually buy whicheverWDT brandNN of coffee is on sale .

Cf. PTB guidelines for more details.


No verb should have -ing in its lemma. However, nouns ending with -ing should keep the -ing in their lemma. Some words can be both nouns and verbs; categorize them based on the specific instance.

  • writingVVG beautifully has the lemma write
  • beautiful writingNN has the lemma writing

non-standard forms

* The pronoun "em" (e.g. we saw 'em) is tagged PP and lemmatized "they"
* "kinda", when used to mean "approximately", is tagged RB and lemmatized "kinda"
* "gonna", "wanna" are tokenized gon + na, wan + na and tagged using standard forms, e.g. gon/VVG/go na/TO/to

Interrupted words

If an interrupted word can be obviously reconstructed, it is given it's normal POS tag:

  • I wanted to distrac-/VV

If the the reconstruction is uncertain, the tag UH is used:

  • We chose a d-/UH (could be JJ, NN, NP, something else …)

Lexicalized words

If a multi-word construction has been lexicalized into one word (i.e. rapidly-growing rather than rapidly growing, then it must be treated as a lexicalized adjective or noun rather than a verb. Most often, these become JJs, such as

  • a rapidly-growingJJ plant

Lexicalized nouns exist too, like

  • the constant egg-layingNN

The lemmas of these words keep the gerund, i.e. egg-laying and not *egg-lay.


URL should be tagged as proper noun (NP) (effectively the name of a ‘place’)

gum/tagging.txt · Last modified: 2021/09/27 19:27 by sp1184