User Tools

Site Tools


twitter

Twitter

Preamble

This is a guide to annotating “microblog” entries (i.e. tweets) from Twitter.

Annotation Guidelines

Tokenization

* is always a token.

“…” is one token.

“dessert/espresso” s.b. “dessert” “/” “espresso” (three tokens).

Utterance Segmentation

Most tweets are in and of themselves utterances. NO

Danglies

Tweets will have a mixture of URLs and hashtags dangling to the end of the tweet. Dangling URLs and strings of hashtags and at-mentions are separate utterances.

Example:

#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ

[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]

NPs as free utterances

If it's not a list, NPs can be separate utterances.

Example:

#Coffee Instagram by @kmarrero7 Iced coffee Starbucks newbie #forgotthemilk #dark #espresso #palpiations #ineedanek … http://t.co/pT42jAdnmQ

[#Coffee Instagram by @kmarrero7] [Iced coffee Starbucks newbie] [#forgotthemilk #dark #espresso #palpiations #ineedanek …] [http://t.co/pT42jAdnmQ]

Constructed Dialog

Constructed dialog with explicit speakers should be thought of as an abbreviated forms of “X said” and are therefore one utterance.

Example:

Me: “Lexi‚ are you drinking black coffee?“ Lexi: “Yeah‚ like my soul“ http://t.co/VByA66xwsN

[Me: “Lexi‚ are you drinking black coffee?“] [Lexi: “Yeah‚ like my soul“] [http://t.co/VByA66xwsN]

Part of Speech Tagging

Dead Simple Tags

Modern Internet initialisms, such as “idk”, “brb”, “lol”, are ACR. This is intended to be for the benefit of CMC researchers, who may be interested in these explicitly despite their numerous possible syntactic roles. Additionally, some initialisms aren't constituents: “idk” takes a whole CP as an argument. Better keep it simple here.

“RT” are IN.

URLs should be tagged URL.

Usernames (begins with @) are NP/NPS. They're names typically. If they aren't, they've been turned into one.

“…” is “:”, since it may not be sentence terminal.

#Hashtags

# is Twitter's most infamous feature. First and foremost, hashtags retain their original parts of speech, but they should also be marked “H” because they're licensed to stand alone in some circumstances. Examples:

  • NNH: “#coffee”
  • NNSH: “#beers”
  • NPH: “#kimkardashian”
  • VVGH: “#winning”
  • JJH: “#hot”

For cases that are more complex, take the head word of the highest level constituent.

  • NNH: “#BananaSplit”
  • NPH: “#HeladoGourmet”
  • NNSH: “#HatsOnForHarry”
  • NNH: “#Brownie&Cream”

Constituent Parsing

Dependency Parsing

twitter.txt · Last modified: 2015/03/23 15:03 by des62