User Tools

Site Tools


gum:tokenization_segmentation

Tokenization and Sentence Annotation

Tokenization

Unit Symbols

  • Currency: Tokenize units of currency apart, e.g.: $350 is two tokens, $ and 350. This makes sense because it is read as two words, and the number 350 is functioning in its usual way, combining with $ to create a compositional phrase (350 dollars)
  • Temperature symbols: separate temperatures such as 35°C into three separate tokens (For example: 35°C –> 35, °, and C on separate lines). Rationale: F/C are different compositional constructs and should be treated similarly to currency ($ is a different construct than £).

Hyphenation

As a general rule, hyphenated words should be kept together. This is especially true of words that are determinative compounds, where the modifier cannot take a plural form and does not constitute an independent word. For example:

  • 10-year plan (10-year is one token: if 10 were modifying year as an independent word, we would see 'years')
  • one-liners (note the plural -s inflects the whole 'one-liner'; separating 'one' would imply there is a word 'liners', and a subtype of that is one-liners, but actually this is the plural of the noun 'one-liner')

The same logic applies to participles and their argument, as well as 'self':

  • energy-based (1 token)
  • self-proclaimed (1 token)

Some exceptions to keeping hyphens together are spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds):

  • 10:00-12:00 (three tokens)
  • Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')
  • China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)

URLs and symbols from the Web

Keep URLs together, even if they contain discernible words:

Plurals with apostrophes

Many dates are written as if they contained a genitive 's. These items should be treated as plurals, and thus as single tokens. For example:

  • 1600's (single token)

Indicating original spacing around tokens spelled together

Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:

  • We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)
  • We distinguish original “apples / oranges” from “apples/oranges” by adding <w> around the latter (it’s three tokens either way)
  • contractions such as “didn't” do not get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. “did” and “n't”) were originally written without an intervening space.

The <w> tag is not used in cases of morphologically complex words which are analyzed as single tokens, such as:

  • “graveyard”
  • “granddaughter”

Sentence Annotation

Segmentation

  • Full sentences are segmented using the <s> tag during XML mark up (see TEI Markup)
  • The text is divided entirely into non-overlapping sentences, so that every token is part of exactly one sentence
  • No tokens are left outside of sentences, meaning that headings and image captions are also surrounded by <s> tags
  • It is possible for a caption to include multiple sentences, each enclosed in <s> tags

Sentence Types

Each sentence tag <s> receives a type attribute from the following list:

  • decl - declarative sentence (indicative)
  • imp - imperative
  • sub - subjunctive, including modals like would, could, but not indicative future 'will'
  • q - a polar, yes/no question
  • wh - a WH question (e.g. who, what, why, where, when, how)
  • inf - an independent infinitive-headed clause (e.g. 'To kill a mockingbird.', or 'How to dance.')
  • ger - an independent gerund-headed clause (e.g. 'Finding Nemo')
  • intj - an interjection utterance ('Yes.', 'Hello!', 'Um…')
  • frag - a fragment without a subject predicate structure, lacking a finite verb, not covered by the above ('The End.', 'At home.')
  • other - a construction not covered by the above (e.g. nominal predication 'Nice, that!') or a coordination of two or more types above ('I'm done and you shut up now!' - decl + imp).
Exceptions and doubtful cases

In certain cases, what looks like a modal can actually be indicative, e.g. 'can' describing ability - this should be tagged as decl if it's simply a statement of fact:

  • <s type=“decl”>I ca n't swim</s>

A modal 'can' of potential, not ability, is tagged 'sub' (this is the more common case):

  • <s type=“sub”>You can find some in the supermarket</s> (not debating hearer's ability to do so, just saying this is a possible option).

Similarly 'will' can be used in a non-indicative way and the sentence will be tagged 'sub'

  • <s type=“sub”>Boys will be boys</s> (i.e. they may well behave as boys; this is not an indicative future claiming some boys will in fact be boys)
  • <s type=“sub”>I couldn't stand it if it spoke.</s> (i.e.Whenever it might have spoken, I wouldn't have been able to stand it.)
When to use 'other'

The category 'other' is meant for two cases:

  • Completely different type of sentence which falls under no other type (e.g. 'method 2 to the rescue' - it's not a normal declarative, but it has a subject-predicate structure, so it's not a fragment)
  • Sentence containing two complete clauses of varying types (e.g. do it and I don’t care how! – imp + decl)

The 'other' category does not apply when there is a main clause of one type and a subordinate clause of a different type, e.g. “washing the dishes, John noticed the burglar” - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type (“ger”), since there is really only one main matrix clause: the past tense one with “noticed”.

Prioritization when multiple types apply

There is a hierarchy among the sentence types that sometimes comes into play when sentences fit two definitions. Specifically, being a question gets ‘first dibs’ on the sentence type. We might have wanted to say about a sentence that it’s both hypothetical and a question, for example: “Would you do it if you could?”. but we only get one label, and whether or not something is a question is seen as more crucial, so this example gets the type “q” (yes/no question).

Unless you have mixed sentence types ('other'), priorities are:

  • wh beats q (if a question has a wh word, it is wh)
  • q beats anything else
  • frag beats intj (e.g. “yes, that book” is frag)
gum/tokenization_segmentation.txt · Last modified: 2017/09/22 10:22 by zw85