This is an old revision of the document!
Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens.
All non-letter, non-numeric characters should be segmented as their own tokens apart from the surrounding tokens, with the following exceptions:
One notable divergence from many non-spoken corpora is that the text that makes up the corpus is a transcribed representation of an audio file. With conventionalized spellings of tokens that stem historically from multiple-word expressions—these tokens having undergone a process of phonological reduction followed by reanalysis as a single unit—the following tokens appear fairly regularly in the corpus:
gotta, kinda, gonna, wanna, etc.
These should all be treated as a single token. This is a revision from earlier coding guidelines which decomposed them into two tokens based on the uncontracted forms from which they were derived either historically or synchronically (for the examples above: got to, kind of, going to, want to, respectively).
When these contracted forms show up as even more reduced items (e.g., Where you gone meet?), they should be treated in the same manner as above. Test for phonological reduction in a pair such as gonna/gone by mentally substituting gonna when gone is encountered in the text to make sure the two actually carry the same function.
The one exception to this guideline is Imma, a contracted form stemming from I’m going to. Since this form contains the subject pronoun in addition to verbal material, the subject I must be split off and tokenized separately, while, for consistency with the other contracted auxiliaries mentioned above, mma is kept together. An example is given below:
Imma switch places with you, Mo. → I mma switch places with you , Mo .