Tagging and lemmatization


Do and Don't

Tagging “don't” (or really two tokens, “do” and “n't”): the verb 'do' is not considered an auxiliary in the PTB scheme in the sense of having a special tag. If it’s a present form like “I don’t do X” then the first ‘do’ is VVP and the second 'do' is VV (a base form); if it’s an imperative like “Don't go!”, it’s VV for both verbs (imperative is not considered present). The negation ‘not’ and also the form ‘n’t’ is considered adverbial (compare ‘very good’ vs. ‘not good’ – both modifiers are adverbs). As a result, it’s tagged RB.

Proper Nouns and Titles

Titles of books, films, etc.: tokens are considered NP or NPS if they are capitalized, but function words are tagged as normal. So for Starship Troopers, both words are considered ‘proper’ and tagged: Starship = NP and Troopers = NPS. But for “Beauty and the Beast” we get: NP, CC, DT, NP

Comparatives with more / less

In cases like “more interesting”, we have two tokens - ‘more’ itself is tagged JJR, but 'interesting' is still just a normal JJ. If you're counting comparatives in the corpus, counting JJR still gets you


No verb should have -ing in its lemma. However, nouns ending with -ing should keep the -ing in their lemma. Some words can be both nouns and verbs; categorize them based on the specific instance.

  • writingVVG beautifully has the lemma write
  • beautiful writingNN has the lemma writing

Lexicalized words

If a multi-word construction has been lexicalized into one word (i.e. rapidly-growing rather than rapidly growing, then it must be treated as a lexicalized adjective or noun rather than a verb. Most often, these become JJs, such as

  • a rapidly-growingJJ plant

Lexicalized nouns exist too, like

  • the constant egg-layingNN

The lemmas of these words keep the gerund, i.e. egg-laying and not *egg-lay.

