User Tools

Site Tools


chat_part_of_speech_tagging

Chat part of speech tagging

New tags:

ACR - CMC acronyms, such as “lol” or “brb” should be tagged as ACR.

EMO - Emoticons, such as “:)” or “:-o”, should be tagged EMO. Emojis should also receive the EMO tag.

Old tags:

NP - Identifying proper nouns in CMC is somewhat problematic given the lack of uniform capitalization of such. The following is an non-exhaustive list of proper noun types encountered in the present chat corpus: given names, user names, brand names, place names (including state abbreviations) and titles.

PRP/PP - In addition to lower case “i” used as a personal pronoun, other variants or personal pronouns such as “ya” were also observed. These should all continue to receive the PRP or PP tag, depending on the tag set being used.

SYM - This tag is distinct from EMO, used for any non-emoticon combination of punctuation and symbols, such as “–>” to indicate “pointing” at a chat room participant. The use of this tag may have to be reconsidered for cases where the symbol is integrated into the grammatical structure, or for more complex pictures drawn with keystrokes.

UH - Consistent with the original PTB tags, variants of “yes” (e.g., “yeah”), “no” (e.g.,“nope”), “hello” (e.g, “hiya), etc. should also be tagged as UH. “haha” (and similar), as well as curse words and onomatopoeia (e.g., “tick tock”) should receive this tag. “Waddup”, “hiya” and other single-token greetings originally derived from multiple words should be tagged UH.

Problematic cases:

The relaxed spelling and grammar of the online environment leads to some complications with POS tagging. As mentioned in the tokenization section, words like “its”/“it's”, “your”/“you're” present issues. In guidelines for POS tagging standard texts, the rule is to tag an element as it appears, not as what is grammatically correct or was intended. However, to do that for chat text would incorrectly brush off these instances as errors. Thus, these guidelines for chat text will go against the “tag it as it lay” approach previously established.

In some cases, the tokenization and POS tagging is more evident than others. A typical clear example missing an apostrophe (not from the present corpus):

it [enter] s good

This form likely resulted from an error: the return key was hit in lieu of the apostrophe. However, this example does give us the desired split between “it” and “s” that allows for easy tokenization and POS tagging.

A typical case showing no split on the part of the user between the PRP and the verb, taken from the present corpus:

its a family tradition

In this case the “its” requires splitting into two tokens, so that the “s” could be correctly labeled as VBZ:

it s a family tradition

As many of these incorrect tokenizations are confusable with another form, automatic tagging of chat text should be carefully examined for these errors. The following context provides extremely useful information. For example, in the below example, “your” is most likely to be “you're”, based on the following word:

your gonna need a bigger boat

Here, “gonna” is an indicator that “your” should be split and tagged as PRP, VBP.

chat_part_of_speech_tagging.txt · Last modified: 2018/09/11 10:02 (external edit)