User Tools

Site Tools


Chat Tokenization

Missing Apostrophes:

As expected with CMC, standard rules of punctuation and capitalization are not consistently observed, for example, “im” is frequently used for “I'm”. However, this should still be separated into two tokens. The lack of an apostrophe should not be taken as evidence of a single token, and although likely less common, the presence of one should not be the sole criterion for the creation of two tokens.

While it is possible to compile a complete list of the non-apostrophe variants to aid in tokenization, some problems will arise with this approach. While “im” is an acceptable variant of “I'm”, “im” can also be a verb meaning “instant message”; “ill” can mean “I'll” or “ill” [sick]. “It's” and “its” will also be a problematic case. Where it's means “it is”, it should be split into two tokens, while the possessive form should remain as one unit. This will be addressed further in the section on POS tagging.

From the present corpus:

hi 31f ga im me to chat

Automatic tokenization incorrectly rendered this sentence as:

hi 31f ga i m me to chat

Extra Punctuation:

Extra sentence final punctuation is common in this chat corpus. This may take the form of repeating the same punctuation mark, as in “???”, or in the combination of multiple types of punctuation marks, as in “..?”. Ellipses may be either the standard three periods or more. Ellipses may also be composed of commas. Multiple or extended sentence final punctuation should be made one token, reflecting the fact that the sentence is only ending once. A general rule of creating a single token out of multiple adjacent punctuation marks not separated by spaces will also help to preserve emoticons, with each emoticon being a single token.


Where did the 2 come from ???

Originally was tokenized as:

Where did the 2 come from ? ? ?


In spoken English, “wanna” is a common variant for “want to”. This form also appears in the chat corpus, sometimes spelled “wana”. Although the evidence is not firm that “wanna” is precisely equivalent to “want to”, it is split into two tokens here (“wan” “na”) for the purposes of being able to better compare variants of the “want to” construction. The same rule should apply to similar constructions such as “gonna”.

Concatenated multiple parts of speech:

One interesting feature of this corpus is the use of the A/S/L (age/sex/location) identifier common to chat rooms involving strangers. While these may be either hyphenated (as in “27-m-canada”) or not (as in “31f”) they should be preserved as a single token. The use of these elements in the present corpus suggests that they function as a single unit similar to PP (PRP) or NN. Preserving them as a single token has the additional benefit of making syntax annotation simpler.

chat_tokenization.txt · Last modified: 2021/02/11 16:44 (external edit)