User Tools

Site Tools


Constituent parsing guidelines


–FRAG over complex INTJ??? 1.5 and 1.6 tick tock with one INTJ over it vs. two 1.12 oh damn

2.39 okay = yes vs. ADJP - changed to UH from JJ

also 2.63 and 2.60

In general, the syntax of chat is not too dissimilar to that analyzed in the original PTB guidelines. This section will address differences and problematic cases.

New POS tags:

The new POS tags created for this corpus, EMO and ACR, are both labeled INTJ. Utterances of only interjections should not receive a higher FRAG label. (An exception will be addressed later for questions.)

1.1 (ROOT

  (INTJ (ACR LoL) ))

2.81 (ROOT

      (NP-SBJ (PRP i) )
    (MD 'll) 
    (VV get) 
        (NP (PRP you) )
        (NP (DT a)  (RBR more)  (JJ current)  (NN one) ))) 
      (INTJ (EMO :-o) )))

Parentheses used in emoticons will need to be replaced with -LRB- and -RRB- (left and right round bracket). INTJ is attached at the highest S (or equivalent) node. Although CMC acronyms and emoticons might have some functionality equivalent to that of sentence ending punctuation, INTJ is always the labeled used.

SYM, extended here for functional symbols (such as a “–>” to “point” at a user), should receive a function tag based on its use. The deictic use should receive the -DIR tag. While this is the only instance appearing in this corpus, other likely functions for other similar symbols in CMC include -LOC, -BNF, -EXT and -TMP (adverbial function labels).

2.95 (ROOT

(ADVP-TMP (RB Now) )
(NP-VOC (NNP 11-09-20sUser114))  
      (NP-SBJ (PRP you) )
              (MD must) 
                      (VV dance) 
                      (PP-CLR (IN with) 
		    (NP (SYM-DIR  -->) 
                              (NNP 11-09-20sUser43) )))) 
      (SENT !!) ))


Vocatives are a very frequent feature in this corpus, not unsurprising given the number of participants and multiple threads of conversation taking place simultaneously.

1.9 (ROOT

  (NP-VOC (NNP 11-09-20sUser101) )
(INTJ (UH yes))
  (NP-SBJ (PRP i))
	(VBP 'm) 
(ADVP-LOC-PRD (RB here) ))))

In some cases, the vocative may not be a unique user name, but rather an address to the whole room or a term of endearment:

2.1 (ROOT

      (NP-SBJ-1 (PRP I) )
    (MD 'd) 
	(VV like) 
                      (NP-SBJ *-1)
		    (TO to) 
			(VV chat))))))
(NP-VOC (NN cutie) )))

For other uses of vocatives, see the next section, greetings.


Greetings (-GREET) are a common feature of this corpus. They generally consist of at least one token labeled UH, and may or may not contain an NP-VOC. Under the normal rules, both INTJ (covering only the UH) and NP-VOC would be attached directly to S. However, making a new category allows us to easily distinguish between the below utterances:

Hi room

lol bubbles007

Additionally, given the frequent use of vocatives, a -GREET label allows us to see which vocatives are part of a greeting, and which ones are used in other contexts, such as directing a question to a specific user. The same logic applies to interjections, and being able to separate “lol” from “hello”.

When the greeting consists of an UH and an NP-VOC, -GREET is attached to FRAG. -GREET attaches to INTJ when the greeting is solely UH, or to NP-VOC when someone's name is being used as a greeting. A lone vocative used as a greeting may arise in a situation where a frequent participant in the chat room has been absent for a long time. Upon their return, if they are met with:


this should be labeled NP-VOC-GREET.

-GREET may apply to the entirety of the utterance:


It may also be part of a longer unit:


In addition to the above type of greeting, there is a second kind occurring in this corpus: a greeting that includes an introduction.

2.50 (ROOT

      (GREET (INTJ (UH hi)) 			
      (NP (NN 31f)(NNP ga) )))
    (S (NP-SBJ-1 *)
	(VV im) 
	    (NP (PRP me) )
		(S (NP-SBJ *-1)
		         (TO to) 
			         (VV chat) ))))))

In this case hi 31f ga is labeled GREET. However, as the NP here is a reference to the user, and not a vocative, it does not receive the -VOC tag. As this is the only example of an introduction combined with a greeting in this corpus, more robust guidelines have not been developed. However, if this proves to be a more common phenomenon, it may be worth creating a new tag (such as -SELF) to cover user introductions.


As is common in CMC, users sometimes express that an “action” is taking place. However, this is not meant to describe a real action unfolding offline, but rather what a user would do if circumstances were a certain way, or is doing in the imaginary real-world space of the chat context. For example:

all i want to do is activate my new credit card and ive been on hold for 45 minutes :(

*bangs head on desk*

In this example, user one describes a real situation that has been a frustrating experience. She follows it up by stating the equivalent of “right now I am banging my head on the desk”; however, it is not assumed that she is actually doing this action.

In the chat corpus used here, actions are expressed without the normal notation of surrounding asterisks:


Some CMC intuition on the part of the annotator is required for labeling ACTION. ACTION should be generally identifiable by lack of a personal pronoun (or overt subject) and use of the present instead of the present progressive to describe events, states, etc. However, ACTION should not be used for all hypotheticals, even ones that surround actions. Thus a following sentence related to this pretend game of spin the bottle, Now you must dance with –> 11-09-20sUser43!! does not receive the ACTION label.


As with spontaneous speech, chat often features less than ideally composed questions. This section will address SQ and a new category, FRAGQ.


In the PTB guidelines, SQ is used for “Inverted yes/no question, or main clause of a wh-question, following the wh-phrase in SBARQ”. It is also used for questions lacking both a subject and auxiliary. In the chat corpus, SQ is used for all these cases, plus questions that are only missing do-support. These questions are relatively common in the present corpus.

1.15 (ROOT

   (SQ (NP-SBJ-1 (NN anyone) )
     (VVP want) 
	(S (NP-SBJ *-1)
		(TO to) 
			(VV chat) 
			    (PP-CLR (IN with) 
			    (NP (PRP me) ))))))))

2.25 (ROOT

(NP-SBJ (NN somebody) )
    (VVP type)
            (VVG perving) ))

(SENT ?) ))


In some cases, the question lacks not just auxiliaries and subjects, but any VP. These are labeled FRAGQ. The sheer number of fragments in the this corpus necessitates a separate category for fragment questions, as to subsume them all under one label would mean a loss of key information.

2.11 (ROOT

      (NP-SBJ (NN anybody) )
    (IN from) 
	(NP (NNP wisconsin) )) (SENT ?) ))

2.28 (ROOT

(NP-VOC (NNP 11-09-20sUser101) )
    (IN like) 
	(NP (WP what) ))))

2.45 (ROOT

(WHNP (WP What)				
          (NP (DT the) 				
          (NN hell) ))

(SENT ?) ))

FRAGQ is also used in limited circumstances with some sentences containing a VP. As seen below, the first is missing more than just the auxiliary and existential there (“is there”). This produces a sentence whose grammaticality is degraded as compared to the SQ examples above. This degradation makes FRAGQ the stronger choice.

2.53 (ROOT

(NP-SBJ (NN Anyone) )
	(SBARQ (WHNP-1 (WDT who) )
		(S (NP-SBJ *T*-1)
			    (VVP want) 
				(S (NP-SBJ *-1)
					(TO to) 
						(VV chat) ))))))

(SENT ?) ))

FRAGQ is also in cases lacking the copula:

2.69 (ROOT

      (NP-VOC (NNP 11-09-20sUser105) )
	(WRB why) )
		(NP-SBJ (PRP u) )
		    (VVN pissed) 
                  (ADVP-PRP (*T*-1)))))))

However, it is important to note that copula deletion is standard to some dialects, such as AAVE. However, there is no indication that the users of this chat room are speakers of this dialect, as they employ standard copula usage except in some questions. Different guidelines should be adopted if the chat room is occupied primarily by people for whom copula dropping is standard.

FRAGQ also covers instances where the question is composed entirely of an interjection, plus an optional vocative. Only one example of this occurred in the present corpus.

2.13 (ROOT

  (FRAGQ (INTJ (UH huh)) 
      (SENT ?) ))

FRAGQ is thus a rather robust category that can contain rather complex structure or very minimal structure. Its labeling as a fragment may be due to either the lack of a key element, degraded syntax, or something in between.



This corpus is littered with fragments. FRAG here covers slightly more ground than in the PTB guidelines, although its use here is a natural extension of the original intentions. A range of examples will be provided here to demonstrate the extent of FRAG.

The next two sentences are FRAG due to the lack of a VP:

1.13 (ROOT

(NP-VOC (NNP Officer)  (NNP 11-09-20sUser114) ) (: ...) 
    (IN on) 
	(NP (DT the)  (NN ball) )
		(IN as) 
			(JJ usual) )))))

2.54 (ROOT

  (FRAG (INTJ (UH ah) )
(NP (JJ good) (NN night) 
	(S (NP-SBJ *)
                  (VP (TO to) 
		(VP (VH have) 
		    (NP (DT the)  (NN top) )
			(ADVP-DIR (RB down) )))))))

Expressions consisting solely of foreign words, even when the syntax is correct for the language, receive FRAG. Different guidelines should be developed if the chat text consists of significant portions of two languages.

2.20 (ROOT

  (FRAG (FW no)  (FW comprende) ))

Following the PTB guidelines, answers to questions are FRAG:

2.21 (ROOT

(NP (NN one) )
	(IN in) 
	(NP (NNP appleton) ))))    

Utterances consisting solely of INTJ (including a VOC) do not take a dominating FRAG node (an exception applies to questions). However, more complex utterances centered around INTJ should receive the FRAG label, consistent with FRAG covering answers to questions. In this case, User34 had previously expressed that they were just here to look, and did not require anything else. User126, the other participant in this exchange, then acknowledges this with “ok”:

2.60 (ROOT

    (UH ah) )
	(, ,) 
	(ADJP (JJ ok) )
	(NP-VOC (NNP 11-09-20sUser34) )))

Problematic cases:

Some utterances in this corpus seem to defy easy understanding. Annotator intuition should be used here to find the best framework for that particular analysis. Problematic cases from the present corpus are discussed below.

Some of the problematic cases are ambiguous due to a lack of proper punctuation. If this is the case, the simpler option of choosing the punctuationless analysis should be pursued, unless there is an overriding semantic reason to make the opposite choice. The thread of chat can be very difficult to follow, and the annotator should be warned that the meaning may not always be recoverable.

This sentence could be read as “See this person” or “See, vocative”. The former interpretation was chosen as this analysis is lower cost.

2.36 (ROOT

  	(NP-SBJ *)
    (VV Seee) 
    (NP (NNP 11-09-20sUser101) )) 
   (: ...) ))

However, some cases will present two options, each based on a correction. Although these guidelines have previously established that some cases will require ignoring spelling errors and interpreting a token in the given context, this should only be done when the desired syntax is clear. The thread of the conversation is difficult to follow at this point, but it does suggest that the analysis below is correct.

2.44 (ROOT

       (NP-SBJ (PRP I) )
	    (VVD did) 
		    (TO to) 
			(NP (NNP 11-09-20sUser59) )))))

Thus this should be read as “I did, to User59”, and not “I did too, User59”, which would be a perfectly viable option given the messy state and lax rules of chat.

Some utterances will continue to be opaque even after many readings.

2.74 (ROOT

    (VVD cut) 
	(NP (PRP you) )
	    (RP off) ))))

The lack of punctuation makes the meaning difficult to discern. As another user had previously asked “Did some fucker cut me off”, it seems that this user is echoing with “cut you off”. It could be a question, but without any clue that this is the case, the utterance receives FRAG.


Sometimes tags will collide in unexpected ways.

2.40 (ROOT

(NP-SBJ (PRP it) )
	(VBZ s) 
	    (ADVP-TMP (RB always) )
	(: .....)
	    (INTJ-PRD (`` ") 
	    (UH ewwwww) 		
      	    (" ") )))) 

The context for this sentence is a discussion among some of the participants about perving on people (creepily flirting). “ewwwww” here is certainly an interjection. Its position post-be and the similarity to an adjective suggests the PRD label given here. Annotators following these guidelines should merely be ready to make seemingly quirky decisions if that is what application of these and the PTB would lead to.

One last note involves the use of chat speak. In some cases this can be very opaque, and annotators should not spend too much time trying to interpret unknown elements, especially where there is not one clear interpretation, as in many cases it should not matter. For the curious, however, Urban Dictionary is a good resource for attempting to interpret.

One example in the present corpus is hb!, an acronym unknown to the author of these guidelines. This has multiple interpretations that would fit with the discussion taking place, such as “hot babe” or “hot bitch”. However, it also has the meaning “hurry back”. Given that the immediately preceding utterance was brb (“[I'll] be right back”), “hurry back” is the most logical choice. However, as in either case it would receive the ACR label, its exact meaning is irrelevant.

chat_constituent_parsing.txt · Last modified: 2018/09/11 10:02 (external edit)