User Tools

Site Tools


Word segmentation

Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens.

All non-letter, non-numeric characters should be segmented as their own tokens apart from the surrounding tokens, with the following exceptions:

  • Possessives: Singular possessive 's is split from the preceding word with apostrophe intact, plural ' is split from the preceding word as well.
  • Compounds: When compounds are formed with a dash in the text, this dash is kept in place and the compound word is kept together.
  • Contractions: The contracted form is split from the preceding word with apostrophe intact. An exception to this is second person plural form y’all, which is kept intact.
  • Numbers: When punctuation is used in a cardinal number, the punctuation is kept in place and the number is kept as a single token.

One notable divergence from many non-spoken corpora is that the text that makes up the corpus is a transcribed representation of an audio file. With conventionalized spellings of tokens that stem historically from multiple-word expressions—these tokens having undergone a process of phonological reduction followed by reanalysis as a single unit—the following tokens appear fairly regularly in the corpus:

gotta, kinda, gonna, wanna, etc.

These should all be treated as a single token. This is a revision from earlier coding guidelines which decomposed them into two tokens based on the uncontracted forms from which they were derived either historically or synchronically (for the examples above: got to, kind of, going to, want to, respectively).

When these contracted forms show up as even more reduced items (e.g., Where you gone meet?), they should be treated in the same manner as above. Test for phonological reduction in a pair such as gonna/gone by mentally substituting gonna when gone is encountered in the text to make sure the two actually carry the same function.

The one exception to this guideline is Imma, a contracted form stemming from I’m going to. Since this form contains the subject pronoun in addition to verbal material, the subject I must be split off and tokenized separately, while, for consistency with the other contracted auxiliaries mentioned above, mma is kept together. An example is given below:

Imma switch places with you, Mo. → I mma switch places with you , Mo .

tokenization_-_segmentation_into_words.txt · Last modified: 2018/09/11 10:02 (external edit)