User Tools

Site Tools


tagging_of_parts_of_speech

Part of Speech Tagging

The PTB tagging guidelines contain 36 part of speech tags. These updated guidelines, for ease of use and compatibility, restrict themselves to the tagset used in the PTB guidelines, but some cases are encountered where novel standards must be introduced. These cases are laid out below.

Novel Forms

The preceding section on word segmentation introduces words found in the corpus and considered to stand alone apart from the multi-word sequences from which they were originally derived. These words, gonna, wanna, gotta, kinda, and Imma, will need to be tagged according to their role in the sentence. For instance, as discussed earlier, Imma contains both a subject and an auxiliary (and some would argue that subject, auxiliary, present participle, and infinitival ‘to’ are all fully present in this one form).

The PTB part of speech tagging guidelines do not differentiate between auxiliaries and main verbs, but only between inflected and uninflected verbs. Therefore auxiliaries get the tag corresponding to their inflection (here meaning whether they carry third person singular –s or not).

EX:
     I	mma 	switch 	places 	with 	you
    PP	VVP	VV	NNP	IN	PP
We	wanna	know
PP	VVP	VV
I	gotta	go
PP	VVP	VV

The last example, gotta, alternates in this corpus with have gotta. If strings like this are encountered, the first verb in the sequence should be assumed to carry the inflection, with all following verbs being VV/VVG/VVN, as with more standard sequences of auxiliaries (e.g., will have + main verb)

EX:
   You	have	gotta	email	his	name
PP	VHP	VV	VV	PP$	NN

Contracted forms such as kinda are simpler to analyze. Instead of the NN PP sequence it comes from, it functions in the sentence as a pure adverbial modifier.

EX:
 Y’all	telling	stories	that	kinda	now	put	together	the	pieces
PP	VV	NNS	IN/that	RB	RB	VVP	RB		DT	NNS

The most complicated of the five novel forms laid out above is the case of gonna. In the LCDC, it sometimes appears with a preceding auxiliary (suggesting that it is not inflected), and sometimes the auxiliary is dropped. This only occurs in a particular context—when the auxiliary would in Standard English be a present form of be that is not the first person singular form. Therefore, this should be treated as auxiliary-dropping of the preceding be auxiliary rather than gonna sometimes carrying inflection and sometimes being bare. The earlier rule—inflection only applies to the first verb in a sequence of verbs, all others should be treated as uninflected (or VVG or VVN—still holds in the case of gonna (even though it is hard to view gonna as a traditional infinitival form). The word gone, when used in a context where gonna could replace it, should be treated in the same manner.

EX:
     I	m	gonna	take	you	to	see	this	show
PP	VBP	VV	VV	PP	TO	VV	DT	NN	
We	gonna	give	you	our	personal	email	addresses
PP	VV	VV	PP	PP$	JJ		NN	NN
Where	you	gone	meet
WRB	PP	VV	VV

Bare verbs

The feature common to African American English of auxiliary dropping in certain contexts was introduced in the previous subsection. The following section on constituency annotation will discuss this in far more depth. There are actually two phenomena that occur in African American English which both result in bare-seeming verbs: (1) variable auxiliary dropping, and (2) –s dropping in 3rd person singular present verbs.

3rd person singular present forms without -s should be tagged as VVP for consistency with other forms in the present tense paradigm, with the VVZ tag only being used when the standard third person singular present –s marking is present, which was most likely its original function as separate from the VVP tag. These are rare in the transcription analyzed, although common to general corpora of African American English.

EX:
  what	he	look	like
WRB	PP	VVP	IN

When an auxiliary would appear in Standard English but does not appear in the text, the verb following should be tagged as a bare verb.

EX:
 y’all	telling	stories
PP	VVG	NNS

Miscellaneous

  • Imperatives, even those which seem to have a discursive function, should be treated as verbs.
EX:
  Wait	,	what	?
VVP	,	WP	SENT
  • Multi-word vocatives should both get the NN tag.
EX:
  When	you	going	on	your	date	,	boo	boo	?
WRB	PP	VVG	IN	PP$	NN	,	NN	NN	SENT
  • Speech errors that are corrected immediately afterwards should be tagged as UH, while a speech error that remains uncorrected or an incomplete token due to interruption by another speaker should be tagged as a best guess from the context, with UH being used as a last resort.
tagging_of_parts_of_speech.txt · Last modified: 2018/09/11 10:02 (external edit)