User Tools

Site Tools


Tokenization and Sentence Annotation


Unit Symbols

  • Currency: Tokenize units of currency apart, e.g.: $350 is two tokens, $ and 350. This makes sense because it is read as two words, and the number 350 is functioning in its usual way, combining with $ to create a compositional phrase (350 dollars)
  • Temperature symbols: separate temperatures such as 35°C into three separate tokens (For example: 35°C –> 35, °, and C on separate lines). Rationale: F/C are different compositional constructs and should be treated similarly to currency ($ is a different construct than £).


As a general rule, hyphenated words should be kept together. This is especially true of words that are determinative compounds, where the modifier cannot take a plural form and does not constitute an independent word. For example:

  • 10-year plan (10-year is one token: if 10 were modifying year as an independent word, we would see 'years')
  • one-liners (note the plural -s inflects the whole 'one-liner'; separating 'one' would imply there is a word 'liners', and a subtype of that is one-liners, but actually this is the plural of the noun 'one-liner')

The same logic applies to participles and their argument, as well as 'self':

  • energy-based (1 token)
  • self-proclaimed (1 token)

Some exceptions to keeping hyphens together are spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):

  • 10:00-12:00 (three tokens)
  • Bill Clinton-Al Gore relationship (5 tokens, otherwise we have a token 'Clinton-Al')
  • China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)

URLs and symbols from the Web

Keep URLs together, even if they contain discernible words:

Plurals with apostrophes

Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:

  • 1600's (single token)

Sentence Annotation


  • Full sentences are segmented using the <s> tag during XML mark up (see TEI Markup)
  • The text is divided entirely into non-overlapping sentences, so that every token is part of exactly one sentence
  • No tokens are left outside of sentences, meaning that headings and image captions are also surrounded by <s> tags
  • It is possible for a caption to include multiple sentences, each enclosed in <s> tags

Sentence Types

Each sentence tag <s> receives a type attribute from the following list:

  • decl - declarative sentence (indicative)
  • imp - imperative
  • sub - subjunctive, including modals like would, could, but not indicative future 'will'
  • q - a polar, yes/no question
  • wh - a WH question (e.g. who, what, why, where, when, how)
  • inf - an independent infinitive-headed clause (e.g. 'To kill a mockingbird.', or 'How to dance.')
  • ger - an independent gerund-headed clause (e.g. 'Finding Nemo')
  • intj - an interjection utterance ('Yes.', 'Hello!', 'Um…')
  • frag - a fragment without a subject predicate structure, lacking a finite verb, not covered by the above ('The End.', 'At home.')
  • other - a construction not covered by the above (e.g. nominal predication 'Nice, that!') or a coordination of two or more types above ('I'm done and you shut up now!' - decl + imp).
Exceptions and doubtful cases

In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl if it's simply a statement of fact:

  • <s type=“decl”>I ca n't swim</s>

A modal 'can' of potential, not ability, is tagged 'sub' (this is the more common case):

  • <s type=“sub”>You can find some in the supermarket</s> (not debating hearer's ability to do so, just saying this is a possible option).

Similarly 'will' can be used in a non-indicative way and the sentence will be tagged 'sub'

  • <s type=“sub”>Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)
When to use 'other'

The category 'other' is meant for two cases:

  • Completely different type of sentence which falls under no other type (e.g. 'method 2 to the rescue' - it's not a normal declarative, but it has a subject-predicate structure, so it's not a fragment)
  • Sentence containing two complete clauses of varying types (e.g. do it and I don’t care how! – imp + decl)

The 'other' category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. “washing the dishes, John noticed the burglar” - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (“ger”), since there is really only one main matrix clause: the past tense one with “noticed”.

Prioritization when multiple types apply

There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: “Would you do it if you could?”. but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type “q” (yes/no question).

Unless you have mixed sentence types ('other'), priorities are:

  • wh beats q (if a question has a wh word, it is wh)
  • q beats anything else
  • frag beats intj (e.g. “yes, that book” is frag)
gum/tokenization_segmentation.txt · Last modified: 2016/10/09 21:07 by amir