User Tools

Site Tools


twitter_constituent_parsing

Twitter Constituent Parsing

This is a laundry list for now.

General Notes

Despite being “so non-standard,” the PTB Guidelines are straight-forward in a lot of cases. Aside from many cases that particular to the Internet and Twitter, a lot of the language follows prescriptive language rules, which PTB covers quite well. Some examples of this:

  • Sentence (22)
  • Sentence (24)

It seems like people are using few relative clauses and adverbials. This is probably something that's actually a product of the message length. I'm simply feeling like I don't use a lot of these features. Once I'm done, a cross-corpus comparison ought to be done to verify this intuition.

Phenomena of Interest

Third-Person Self-referential Action Speech

Sentence 5 from the coffee data is an example of this:

@Ribbitttt * comes over a few minutes later, carrying your coffee and a cappuccino [...] 

Gerunds with Implicit First-Person Subjects

Often Twitter users will drop “I'm” or “I am” from a tweet, implying a first person subject.

becoming more of a tea person than a coffee person [...]

In analyzing this sentence (6), I'm making two overt decisions:

  1. I'm adding a label for implicit references back to the user: -USR

Another case where -USR is interesting is sentence (46):

(ROOT
	(S
		(S
    		(NP-SBJ (NONE *))
			(VP
				(VVZ Tastes) 
				(ADJP (JJ great))))
		(, ,) 	
	    (CC but) 
        (S
        	(NP-SBJ-USR (NONE *-1))
			(VP
				(VVP wish) 
				(SBAR
            		(WHNP 0)
					(S
						(NP-SBJ-1 (PRP I) )
						(VP
							(VBD was) 
							(VP
								(ADVP-TMP (RB already) )
								(VVG moving) 
								(PRT
									(RP on) )
								(PP-DIR
									(TO to) 
									(NP-TMP (NNPH #WineWednesday) )))))))) (SENT !) ))

Note the wish I was moving on to #WineWednesday, and that it's not *wish was moving on to #WineWednesday. Cool piece of data there. It seems -USR is licensed from a directly governing top level S. Also, the prodrop from Tastes. But then, inside the relative clause, you can't drop the pronoun. Whoa.

ACR in Context

Sentence (6) contains the first ACR– idk.

  1. idk can take a sentential complement. I have no evidence that it's used with a nominal complement.

@Address

If in front, these are vocatives, as they are addressing the target user. If we see multiple ones, as in sentence (18), they are just Chomsky adjoined. Otherwise, they are treated as their default constituent.

Twitter + Math + Weird Language = Interesting Data?

Sometimes people use “math” in natural language, like in sentence (30). + was a CC here and = is the main verb.

(ROOT
	(S
    	(NP-SBJ
	    (NP (NN School)  (NN drop) (NN off) )
	    (CC +) 			
            (NP (CD 30)  (NNS laps) ) 
            (CC +) 
	    (NP (NN coffee) ) 
            (CC +) 
	    (NP (NN headspace) ))
            
	(VP
	    (VVP =) 
	    (NP 
               (JJ pure)
               (NN bliss) 
	       (PP-LOC
		   (IN @) 
		   (NP (NNP Home) ))))))

Trailing URLs

They're treated as their own utterances.

(ROOT
	(X
		(URL http://t.co/QzP037CeGQ) ))
(ROOT
	(X
		(URL https://t.c) (: …) ))

The '…'s were a broken part of the URL, here.

Here's 39, it's neat.

(ROOT
		(X
			(URL http://t.co/3oeEvIM5Kq) 
			(PP
				(IN At) 
				(NP (NNP Trolley)  (NNP Caffe) ))) (: —) )

Hashtag Streams: #Hashtag #AnotherOne #YetAnother

Streams of hashtags can appear outside of typical syntactic constraints. This phenomenon is called a hashtag stream. Sometimes, whole tweets can be hashtag streams. Typically, though, they follow a sentence at the end of a tweet.

Hashtag streams are treated as separate utterances, so in either the trailing or entire tweet case, they are treated uniformly.

They are technically FRAG, so these utterances branch as such under ROOT.

Case Study 1: Sentence (9)

Let's have a look at sentence (9), though

#chaicofi #cafe #capucino #hot #kochi #EKM #insta #sugar #pic by shabassubair007

[https://twitter.com/ChaiCofi/statuses/560501281270542336]

Determining what constituents are can be much more difficult in hashtag streams than their sentential counterparts. Whereas in normal sentences, there are syntactic clues for the constituent structure in most cases, hashtag streams lack typical sentential structure. They really are FRAGs in the finest sense of the word.

Many of the NPs present here stand alone, but some form constituents with their neighbors–#hot #kochi being the most obvious. EKM seems to be the postal code or an abbreviation for “Ernakulam” in India.

More debatable is '#sugar #pic' as a constituent. If the included picture were focused on sugar, I would call this a #sugar #pic. However, there seems to merely be empty packets of sugar in the photo.

What about other cases? Given that a lot of the vocabulary here #EKM, #kochi, #insta are not known, there are judgement calls to make about whether they form constituents with their neighbors. However, our judgements don't need to be entirely uninformed. We use Google; if kochi EKM, EKM insta, or insta sugar seem to be “things,” in the sense that people talk about some thing in the world that is “insta sugar” or “EKM insta,” then these should be attached as constituents.

This technique even puts hot kochi into doubt. kochi seems to be a place in India or Japan–unless they're identifying the city as particularly hot, which seems unusual to do while drinking a hot cappuccino.

Further investigating, chaicofi refers to the coffee shop where the drink was purchased, which is located in kochi. So this guy isn't just emitting hashtags randomly. There's a lot more structure here than there was initially.

Given the nature of the stream, I'm going to make #capucino #hot a constituent, since I'm assuming that hot describes capucino.

Now that I know more, #chaicofi #cafe is definitely a constituent. This is a cafe, he's not merely stating that he has “coffee” in every language but English.

What about #insta? This is an abbreviation for Instagram, where the image was originally posted. It should be an NNPH, not a JJH.

The take away? Hashtag streams require further inspection for context to determine constituency, especially in cases like this, with lots of foreign or unusual words.

Case Study 2: Sentence (41)

The associated image here is a banana split: [https://instagram.com/p/yacbr5D-a9/]

(ROOT
	(NP
	    (NP (JJ Delicious)  
            (NNH #BananaSplit)  
            (NNH #Capucino) 
	    (NNH #Brownie&Cream)  
            (NNH #Vainilla))
		(NP-LOC (NNPH #HeladoGourmet)  (NNP @GelartiPty)  (NNPH #MetroMall) )))

All of the hashtags seem to describe the banana split, except for #HeladoGourmet which describes where it's from, and whatever @GelartiPty, which seems to only exist as a string in this tweet.

Case Study 3: Sentence (50)

This got me rethinking my whole analysis of hashtags. What if # is a preposition?

(ROOT
	(FRAG
		(PP
		    (IN via) 
		    (NP (NNP @VOXXINews) ))  
                (NP (NNH #coffee))
		(NP (SH #IHaveCoupons) ) 
        (: …) ))

If we want to coordinate similar types, these seem to be coordinated quite nicely.

Case Study 4: Sentences (56 - 57)

Alright, my scheme for doing hashtags has completely fallen apart.

(ROOT
	(S
		(VP
			(VVN Iced) 
			(S
				(NP (NN coffee) )
				(NP (NNP Starbucks)  (NN newbie) )))))
(ROOT
	(S
		(VP
			(VVDH #forgotthemilk) 
			(SBAR
				(S
					(NP (JJH #dark)  (NNH #espresso)  (NNH #palpiations) )
					(VP
						(VVPH #ineedanek) ))))
(: …) ))

So here, the hashtag stream criteria fails in its simplest form–however, I suppose this is a sensible exception.

twitter_constituent_parsing.txt · Last modified: 2018/09/11 10:02 (external edit)