User Tools

Site Tools


tokenization_-_segmentation_into_words

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tokenization_-_segmentation_into_words [2015/05/08 21:33]
sm2842
tokenization_-_segmentation_into_words [2021/02/11 16:44] (current)
Line 1: Line 1:
 +=====Word segmentation=====
 Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens. Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens.
  
Line 13: Line 14:
 One notable divergence from many non-spoken corpora is that the text that makes up the corpus is a transcribed representation of an audio file. With conventionalized spellings of tokens that stem historically from multiple-word expressions—these tokens having undergone a process of phonological reduction followed by reanalysis as a single unit—the following tokens appear fairly regularly in the corpus: One notable divergence from many non-spoken corpora is that the text that makes up the corpus is a transcribed representation of an audio file. With conventionalized spellings of tokens that stem historically from multiple-word expressions—these tokens having undergone a process of phonological reduction followed by reanalysis as a single unit—the following tokens appear fairly regularly in the corpus:
  
- //gotta//, //kinda//, //gonna//, //wanna//, etc.+//gotta//, //kinda//, //gonna//, //wanna//, etc.
  
 These should all be treated as a single token. This is a revision from earlier coding guidelines which decomposed them into two tokens based on the uncontracted forms from which they were derived either historically or synchronically (for the examples above: //got to//, //kind of//, //going to//, //want to//, respectively).  These should all be treated as a single token. This is a revision from earlier coding guidelines which decomposed them into two tokens based on the uncontracted forms from which they were derived either historically or synchronically (for the examples above: //got to//, //kind of//, //going to//, //want to//, respectively). 
tokenization_-_segmentation_into_words.1431120808.txt.gz · Last modified: 2021/02/11 17:01 (external edit)