User Tools

Site Tools


tokenization_-_segmentation_into_words

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
tokenization_-_segmentation_into_words [2015/05/08 21:34]
sm2842
tokenization_-_segmentation_into_words [2021/02/11 16:44] (current)
Line 1: Line 1:
 +=====Word segmentation=====
 Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens. Word tokenization is not straightforward, and all other levels of analysis are dependent on how a given corpus is segmented into words. The word tokenization done on LCDC is very similar to the word tokenization done on the Penn Treebank Corpus and most other corpora. White space separates tokens.
  
tokenization_-_segmentation_into_words.1431120859.txt.gz ยท Last modified: 2021/02/11 17:01 (external edit)