This is a laundry list for now.
Despite being “so non-standard,” the PTB Guidelines are straight-forward in a lot of cases. Aside from many cases that particular to the Internet and Twitter, a lot of the language follows prescriptive language rules, which PTB covers quite well. Some examples of this:
It seems like people are using few relative clauses and adverbials. This is probably something that's actually a product of the message length. I'm simply feeling like I don't use a lot of these features. Once I'm done, a cross-corpus comparison ought to be done to verify this intuition.
Sentence 5 from the coffee data is an example of this:
@Ribbitttt * comes over a few minutes later, carrying your coffee and a cappuccino [...]
Often Twitter users will drop “I'm” or “I am” from a tweet, implying a first person subject.
becoming more of a tea person than a coffee person [...]
In analyzing this sentence (6), I'm making two overt decisions:
Another case where -USR is interesting is sentence (46):
(ROOT (S (S (NP-SBJ (NONE *)) (VP (VVZ Tastes) (ADJP (JJ great)))) (, ,) (CC but) (S (NP-SBJ-USR (NONE *-1)) (VP (VVP wish) (SBAR (WHNP 0) (S (NP-SBJ-1 (PRP I) ) (VP (VBD was) (VP (ADVP-TMP (RB already) ) (VVG moving) (PRT (RP on) ) (PP-DIR (TO to) (NP-TMP (NNPH #WineWednesday) )))))))) (SENT !) ))
wish I was moving on to #WineWednesday, and that it's not *
wish was moving on to #WineWednesday. Cool piece of data there. It seems
-USR is licensed from a directly governing top level
S. Also, the prodrop from
Tastes. But then, inside the relative clause, you can't drop the pronoun. Whoa.
Sentence (6) contains the first ACR–
idkcan take a sentential complement. I have no evidence that it's used with a nominal complement.
If in front, these are vocatives, as they are addressing the target user. If we see multiple ones, as in sentence (18), they are just Chomsky adjoined. Otherwise, they are treated as their default constituent.
Sometimes people use “math” in natural language, like in sentence (30).
+ was a
CC here and
= is the main verb.
(ROOT (S (NP-SBJ (NP (NN School) (NN drop) (NN off) ) (CC +) (NP (CD 30) (NNS laps) ) (CC +) (NP (NN coffee) ) (CC +) (NP (NN headspace) )) (VP (VVP =) (NP (JJ pure) (NN bliss) (PP-LOC (IN @) (NP (NNP Home) ))))))
They're treated as their own utterances.
(ROOT (X (URL http://t.co/QzP037CeGQ) ))
(ROOT (X (URL https://t.c) (: …) ))
The '…'s were a broken part of the URL, here.
Here's 39, it's neat.
(ROOT (X (URL http://t.co/3oeEvIM5Kq) (PP (IN At) (NP (NNP Trolley) (NNP Caffe) ))) (: —) )
Streams of hashtags can appear outside of typical syntactic constraints. This phenomenon is called a hashtag stream. Sometimes, whole tweets can be hashtag streams. Typically, though, they follow a sentence at the end of a tweet.
Hashtag streams are treated as separate utterances, so in either the trailing or entire tweet case, they are treated uniformly.
They are technically FRAG, so these utterances branch as such under ROOT.
Let's have a look at sentence (9), though
#chaicofi #cafe #capucino #hot #kochi #EKM #insta #sugar #pic by shabassubair007
Determining what constituents are can be much more difficult in hashtag streams than their sentential counterparts. Whereas in normal sentences, there are syntactic clues for the constituent structure in most cases, hashtag streams lack typical sentential structure. They really are FRAGs in the finest sense of the word.
Many of the NPs present here stand alone, but some form constituents with their neighbors–
#hot #kochi being the most obvious.
EKM seems to be the postal code or an abbreviation for “Ernakulam” in India.
More debatable is '#sugar #pic' as a constituent. If the included picture were focused on sugar, I would call this a
#sugar #pic. However, there seems to merely be empty packets of sugar in the photo.
What about other cases? Given that a lot of the vocabulary here
#insta are not known, there are judgement calls to make about whether they form constituents with their neighbors. However, our judgements don't need to be entirely uninformed. We use Google; if
EKM insta, or
insta sugar seem to be “things,” in the sense that people talk about some thing in the world that is “insta sugar” or “EKM insta,” then these should be attached as constituents.
This technique even puts
hot kochi into doubt.
kochi seems to be a place in India or Japan–unless they're identifying the city as particularly hot, which seems unusual to do while drinking a hot cappuccino.
chaicofi refers to the coffee shop where the drink was purchased, which is located in
kochi. So this guy isn't just emitting hashtags randomly. There's a lot more structure here than there was initially.
Given the nature of the stream, I'm going to make
#capucino #hot a constituent, since I'm assuming that
Now that I know more,
#chaicofi #cafe is definitely a constituent. This is a cafe, he's not merely stating that he has “coffee” in every language but English.
#insta? This is an abbreviation for Instagram, where the image was originally posted. It should be an
NNPH, not a
The take away? Hashtag streams require further inspection for context to determine constituency, especially in cases like this, with lots of foreign or unusual words.
The associated image here is a banana split: [https://instagram.com/p/yacbr5D-a9/]
(ROOT (NP (NP (JJ Delicious) (NNH #BananaSplit) (NNH #Capucino) (NNH #Brownie&Cream) (NNH #Vainilla)) (NP-LOC (NNPH #HeladoGourmet) (NNP @GelartiPty) (NNPH #MetroMall) )))
All of the hashtags seem to describe the banana split, except for
#HeladoGourmet which describes where it's from, and whatever
@GelartiPty, which seems to only exist as a string in this tweet.
This got me rethinking my whole analysis of hashtags. What if
# is a preposition?
(ROOT (FRAG (PP (IN via) (NP (NNP @VOXXINews) )) (NP (NNH #coffee)) (NP (SH #IHaveCoupons) ) (: …) ))
If we want to coordinate similar types, these seem to be coordinated quite nicely.
Alright, my scheme for doing hashtags has completely fallen apart.
(ROOT (S (VP (VVN Iced) (S (NP (NN coffee) ) (NP (NNP Starbucks) (NN newbie) )))))
(ROOT (S (VP (VVDH #forgotthemilk) (SBAR (S (NP (JJH #dark) (NNH #espresso) (NNH #palpiations) ) (VP (VVPH #ineedanek) )))) (: …) ))
So here, the hashtag stream criteria fails in its simplest form–however, I suppose this is a sensible exception.