User Tools

Site Tools


This is an old revision of the document!

Problem cases encountered in segmentation and part of speech tagging:


      Ex:    Imma switch places with you, Mo.
  Segment into 3 tokens-- I PP, mm VBP, a (aspectual auxiliary? tagged now as ASP but could change). Perhaps should just be AUX category.

Gonna and gone (when can be substituted with gonna)

      Ex:    Where you gone meet?
  Should be segmented so that the nV is a separate word, yielding 'na' or 'ne', which will be tagged as TO as it is phonologically close enough. The 'go(n)' that is left over will be tagged as bare VV in keeping with the other verbs where there is a null copula. This bare VV may be changed later on to indicate null copula or bare affix, maybe like VV_nc and VV_ba.


      Ex: Wait, what?
  This imperative "wait" has a discursive function in the text and appears utterance initially, so it's tempting to just label it UH and not deal with its meaning, but it is honestly a verb of some sort. In the GUM corpus these are labeled as bare verbs, so I am keeping this tag VV for imperatives for now, but this may change and become more specified.

Modal auxiliaries

      Running list: 'll (future)
  Tag as MD

Bare verbs

      Ex: I'm going home tonight and have some conversations
  In my English this would be "I'm going home tonight and having some conversations", so this verb to me is a bare gerundive Have. In this text it's tagged like VH, but may be specified later for the type of bare it is.


      Ex: When you going on your date, boo boo?
  Vocatives will be tagged as NN. A multiword vocative such as above is thus tagged "NN NN"

Wanna, gotta

      Ex:    We wanna know.
             You have gotta email his name
             I gotta go
  Segmentation of wanna and gotta should parallel gonna above. 'na' will be split off an tagged as TO, 'wan' will be tagged as a variant of 'want', therefore VVP. 'gotta' in the second example is a special case, as full auxiliary 'have' is realized, so that the 'got' here, when 'ta' is segmented apart from it, is tagged VVN as a participle. 'gotta' in the 3rd example is not accompanied by 'have' aux, so it is similar to 'need to' as described below, which seems like a modal but isn't because it requires TO preceding an infinitive and, in my English, also requires inflection such as "he's gotta go", also can't have contracted 'nt affixed to it. So 'I gotta go' should be PP VV TO VV for now, should probably be specified later. May also be PP VVP??

Bare 3rd person singular verbs

      Ex:    You have gotta email his name, what he look like, snap a picture of him, cause you going out.
   'look' here should carry the 3rd sg affix -s in Standard English, so I originally tagged it VVZ, but for the sake of consistency with the other bare verbs that I am unsure what to do with, I've changed the tag on it to VV and will probably specify bare person feature marking later.

Need to– tag it as a normal verb + TO, not a modal

      Ex:    in case we need to call you
   Not a modal because it can be inflected and it requires TO preceding an infinitive.

'mine' labeled as PP$ but I'm not committed to it.

Speech errors / interruptions

  A speech error that gets corrected should be tagged as 'UH', a speech error that remains uncorrected should be tagged as a best guess at what it is by inferring from context, with the 'UH' tag being used as a last resort.
tokenization_-_segmentation_into_words.1424280242.txt.gz · Last modified: 2021/02/11 17:01 (external edit)