User Tools

Site Tools


Tokenization and Sentence Annotation


Unit Symbols

  • Currency: Tokenize units of currency apart, e.g.: $350 is two tokens, $ and 350. This makes sense because it is read as two words, and the number 350 is functioning in its usual way, combining with $ to create a compositional phrase (350 dollars)
  • Temperature symbols: separate temperatures such as 35°C into three separate tokens (For example: 35°C –> 35, °, and C on separate lines). Rationale: F/C are different compositional constructs and should be treated similarly to currency ($ is a different construct than £).


As a general rule, hyphenated words should be split, since they can often be spelled apart. For example:

  • 10-year plan (10 - year plan)
  • one-liners (one - liners)

The same logic applies to participles and their argument, as well as 'self':

  • energy-based (3 tokens)
  • self-proclaimed (3 tokens)

Spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):

  • 10:00-12:00 (three tokens)
  • Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')
  • China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)

Exceptions which should not be tokenized apart include:

  • Morphological prefixes (re-, pre-, sub-, anti-)
  • Technical identifiers with hyphens (Bus A-1)
  • Syllabification/pronunciation guides (“it's pronounced soo-per”)

URLs and symbols from the Web

Keep URLs together, even if they contain discernible words or hyphens:

Plurals with apostrophes

Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:

  • 1960s (single token)
  • 1600's (single token)

But if a year really does have a genitive 's in it, it should be tokenized separately:

  • 1969 's hit single , “ Space Oddity ” (two tokens)

Indicating original spacing around tokens spelled together

Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:

  • We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)
  • We distinguish original “apples / oranges” from “apples/oranges” by adding <w> around the latter (it’s three tokens either way)
  • contractions such as “didn't”, “I'm” do not get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. “did” and “n't”) were originally written without an intervening space.

The <w> tag is not used in cases of morphologically complex words which are analyzed as single tokens, such as:

  • “graveyard”
  • “granddaughter”

Sentence Annotation


  • Full sentences are segmented using the <s> tag during XML mark up (see TEI Markup)
  • The text is divided entirely into non-overlapping sentences, so that every token is part of exactly one sentence
  • No tokens are left outside of sentences, meaning that headings and image captions are also surrounded by <s> tags
  • It is possible for a caption to include multiple sentences, each enclosed in <s> tags
  • If direct speech subordinates multiple sentences, up to two sentences are allowed within a single sentence tag together with the main clause containing the speech verb. More than two sentences in direct speech should all receive separate <s> tags. The following examples illustrate this:
    • <s>John said: “I've had it. I'm not doing this anymore.”</s>
    • <s>John said:</s><s>“I've had it.</s><s>I'm not doing this anymore.</s><s>I'm going home.”</s>

Sentence Types

Each sentence tag <s> receives a type attribute from the following list:

  • decl - declarative sentence (indicative)
  • imp - imperative
  • sub - subjunctive, including modals like would, could, but not indicative future 'will', and deontic 'have to'/'got to' (=must)
  • q - a polar, yes/no question
  • wh - a WH question (e.g. who, what, why, where, when, how)
  • inf - an independent infinitive-headed clause (e.g. 'To kill a mockingbird.', or 'How to dance.')
  • ger - an independent gerund-headed clause (e.g. 'Finding Nemo')
  • intj - an interjection utterance ('Yes.', 'Hello!', 'Um…')
  • frag - a fragment without a subject predicate structure, lacking a finite verb, not covered by the above ('The End.', 'At home.')
  • multiple - a coordination of two or more types above ('I'm done and you shut up now!' - decl + imp)
  • other - a construction not covered by the above (e.g. nominal predication 'Nice, that!')

Note that multiple takes priority over other (e.g. decl+other = multiple).

Exceptions and doubtful cases

In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl if it's simply a statement of fact:

  • <s type=“decl”>I ca n't swim</s>

A modal 'can' of potential, not ability, is tagged 'sub' (this is the more common case):

  • <s type=“sub”>You can find some in the supermarket</s> (not debating hearer's ability to do so, just saying this is a possible option).

Similarly 'will' can be used in a non-indicative way and the sentence will be tagged 'sub'

  • <s type=“sub”>Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)
  • <s type=“sub”>I couldn't stand it if it spoke.</s> (i.e.Whenever it might have spoken, I wouldn't have been able to stand it.)
When to use 'multiple'

The category 'multiple' is meant for sentences containing two (or more) complete clauses of varying types (e.g. do it and I don’t care how! – imp + decl)

The 'multiple' category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. “washing the dishes, John noticed the burglar” - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (“ger”), since there is really only one main matrix clause: the past tense one with “noticed”.

The 'multiple' category also does not apply when parenthetical sentences are present; parenthetical sentences may be 'below the level' of the main clause, and so only the type of the main clause applies. For example, the following is a 'sub' type, notwithstanding the parenthetical clause in italics:

  • I would say only that if some of my judgments were wrong–and some were wrong–they were made in what I believed at the time to be the best interest of the Nation
Prioritization when multiple types apply

There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: “Would you do it if you could?”. but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type “q” (yes/no question).

Unless you have mixed sentence types ('multiple'), priorities are:

  • wh beats q (if a question has a wh word, it is wh)
  • q beats anything else
  • frag beats intj (e.g. “yes, that book” is frag)
gum/tokenization_segmentation.txt · Last modified: 2021/09/21 00:38 by nv214