User Tools

Site Tools


tokenization_-_segmentation_into_words

This is an old revision of the document!


Problem cases encountered in segmentation and part of speech tagging:

Imma Ex: Imma switch places with you, Mo. Segment into 3 tokens– I PP, mm VBP, a (aspectual auxiliary? tagged now as ASP but could change). Perhaps should just be AUX category.

Gonna and gone (when can be substituted with gonna) Ex: Where you gone meet? Should be segmented so that the nV is a separate word, yielding 'na' or 'ne', which will be tagged as TO as it is phonologically close enough. The 'go(n)' that is left over will be tagged as bare VV in keeping with the other verbs where there is a null copula. This bare VV may be changed later on to indicate null copula or bare affix, maybe like VV_nc and VV_ba.

Imperatives Ex: Wait, what? This imperative “wait” has a discursive function in the text and appears utterance initially, so it's tempting to just label it UH and not deal with its meaning, but it is honestly a verb of some sort. In the GUM corpus these are labeled as bare verbs, so I am keeping this tag VV for imperatives for now, but this may change and become more specified.

Modal auxiliaries Running list: 'll (future) Tag as MD

Bare verbs Ex: I'm going home tonight and have some conversations In my English this would be “I'm going home tonight and having some conversations”, so this verb to me is a bare gerundive Have. In this text it's tagged like VH, but may be specified later for the type of bare it is.

Vocatives Ex: When you going on your date, boo boo? Vocatives will be tagged as NN. A multiword vocative such as above is thus tagged “NN NN”

Wanna, gotta Ex: We wanna know. You have gotta email his name I gotta go Segmentation of wanna and gotta should parallel gonna above. 'na' will be split off an tagged as TO, 'wan' will be tagged as a variant of 'want', therefore VVP. 'gotta' in the second example is a special case, as full auxiliary 'have' is realized, so that the 'got' here, when 'ta' is segmented apart from it, is tagged VVN as a participle. 'gotta' in the 3rd example is not accompanied by 'have' aux, so it is similar to 'need to' as described below, which seems like a modal but isn't because it requires TO preceding an infinitive and, in my English, also requires inflection such as “he's gotta go”, also can't have contracted 'nt affixed to it. So 'I gotta go' should be PP VV TO VV for now, should probably be specified later. May also be PP VVP??

Bare 3rd person singular verbs Ex: You have gotta email his name, what he look like, snap a picture of him, cause you going out. 'look' here should carry the 3rd sg affix -s in Standard English, so I originally tagged it VVZ, but for the sake of consistency with the other bare verbs that I am unsure what to do with, I've changed the tag on it to VV and will probably specify bare person feature marking later.

Need to– tag it as a normal verb + TO, not a modal Ex: in case we need to call you Not a modal because it can be inflected and it requires TO preceding an infinitive.

'mine' labeled as PP$ but I'm not committed to it.

Speech errors / interruptions A speech error that gets corrected should be tagged as 'UH', a speech error that remains uncorrected should be tagged as a best guess at what it is by inferring from context, with the 'UH' tag being used as a last resort.

tokenization_-_segmentation_into_words.1428424486.txt.gz · Last modified: 2021/02/11 17:01 (external edit)