User Tools

Site Tools


tokenization_-_segmentation_into_words

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tokenization_-_segmentation_into_words [2015/04/07 16:34]
sm2842
tokenization_-_segmentation_into_words [2021/02/11 16:44] (current)
Line 1: Line 1:
-Problem cases encountered in segmentation and part of speech tagging:+=====Word segmentation===== 
 +Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens.
  
-Imma +All non-letternon-numeric characters should be segmented as their own tokens apart from the surrounding tokens, with the following exceptions:
-Ex:    Imma switch places with youMo. +
-Segment into 3 tokens-- I PP, mm VBP, a (aspectual auxiliary? tagged now as ASP but could change). Perhaps should just be AUX category. +
-   +
-Gonna and gone (when can be substituted with gonna) +
-Ex:    Where you gone meet? +
-Should be segmented so that the nV is a separate wordyielding 'na' or 'ne', which will be tagged as TO as it is phonologically close enough. The 'go(n)' that is left over will be tagged as bare VV in keeping with the other verbs where there is a null copula. This bare VV may be changed later on to indicate null copula or bare affix, maybe like VV_nc and VV_ba.+
  
-Imperatives +  * __Possessives__Singular possessive //'s// is split from the preceding word with apostrophe intactplural //'// is split from the preceding word as well.
-ExWait, what? +
-This imperative "wait" has a discursive function in the text and appears utterance initially, so it'tempting to just label it UH and not deal with its meaningbut it is honestly a verb of some sort. In the GUM corpus these are labeled as bare verbs, so I am keeping this tag VV for imperatives for now, but this may change and become more specified. +
-     +
-Modal auxiliaries +
-Running list: 'll (future) +
-Tag as MD+
  
-Bare verbs +  *__Compounds__When compounds are formed with a dash in the text, this dash is kept in place and the compound word is kept together.
-ExI'm going home tonight and have some conversations +
-In my English this would be "I'm going home tonight and having some conversations"so this verb to me is a bare gerundive Have. In this text it's tagged like VH, but may be specified later for the type of bare it is.+
  
-Vocatives +  *__Contractions__The contracted form is split from the preceding word with apostrophe intactAn exception to this is second person plural form //y’all//, which is kept intact.
-ExWhen you going on your date, boo boo? +
-Vocatives will be tagged as NNA multiword vocative such as above is thus tagged "NN NN"+
  
-Wanna, gotta +  *__Numbers__When punctuation is used in a cardinal number, the punctuation is kept in place and the number is kept as a single token.
-Ex   We wanna know. +
-You have gotta email his name +
-I gotta go +
-Segmentation of wanna and gotta should parallel gonna above. 'na' will be split off an tagged as TO, 'wan' will be tagged as a variant of 'want', therefore VVP. 'gotta' in the second example is special caseas full auxiliary 'have' is realized, so that the 'got' here, when 'ta' is segmented apart from it, is tagged VVN as a participle. 'gotta' in the 3rd example is not accompanied by 'have' aux, so it is similar to 'need to' as described below, which seems like modal but isn't because it requires TO preceding an infinitive and, in my English, also requires inflection such as "he's gotta go", also can't have contracted 'nt affixed to it. So 'I gotta go' should be PP VV TO VV for now, should probably be specified laterMay also be PP VVP??+
  
-Bare 3rd person singular verbs +One notable divergence from many non-spoken corpora is that the text that makes up the corpus is transcribed representation of an audio fileWith conventionalized spellings of tokens that stem historically from multiple-word expressions—these tokens having undergone a process of phonological reduction followed by reanalysis as a single unit—the following tokens appear fairly regularly in the corpus:
-Ex:    You have gotta email his name, what he look like, snap picture of him, cause you going out. +
-'look' here should carry the 3rd sg affix -s in Standard English, so I originally tagged it VVZ, but for the sake of consistency with the other bare verbs that I am unsure what to do with, I've changed the tag on it to VV and will probably specify bare person feature marking later. +
-      +
-Need to-- tag it as a normal verb + TO, not a modal +
-Ex   in case we need to call you +
-Not a modal because it can be inflected and it requires TO preceding an infinitive.+
  
-'mine' labeled as PP$ but I'm not committed to it.+//gotta//, //kinda//, //gonna//, //wanna//, etc.
  
-Speech errors / interruptions +These should all be treated as a single token. This is a revision from earlier coding guidelines which decomposed them into two tokens based on the uncontracted forms from which they were derived either historically or synchronically (for the examples above: //got to//, //kind of//, //going to//, //want to//, respectively)
-A speech error that gets corrected should be tagged as 'UH', speech error that remains uncorrected should be tagged as a best guess at what it is by inferring from context, with the 'UH' tag being used as a last resort. +
-    +
  
-         +When these contracted forms show up as even more reduced items (e.g., //Where you **gone** meet?//), they should be treated in the same manner as above. Test for phonological reduction in a pair such as //gonna/gone// by mentally substituting gonna when gone is encountered in the text to make sure the two actually carry the same function. 
-        + 
 +The one exception to this guideline is //Imma//, a contracted form stemming from //I’m going to//. Since this form contains the subject pronoun in addition to verbal material, the subject //I// must be split off and tokenized separately, while, for consistency with the other contracted auxiliaries mentioned above, //mma// is kept together. An example is given below: 
 + 
 +//**Imma** switch places with you, Mo.// → I mma switch places with you , Mo .
  
tokenization_-_segmentation_into_words.1428424486.txt.gz · Last modified: 2021/02/11 17:01 (external edit)