User Tools

Site Tools


gum:tokenization_segmentation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
gum:tokenization_segmentation [2021/02/11 16:44]
127.0.0.1 external edit
gum:tokenization_segmentation [2021/09/13 14:39]
amir GUM8 hyphenation
Line 7: Line 7:
  
 ===Hyphenation=== ===Hyphenation===
-As a general rule, hyphenated words should be kept together. This is especially true of words that are determinative compoundswhere the modifier cannot take a plural form and does not constitute an independent word. For example: +As a general rule, hyphenated words should be splitsince they can often be spelled apart. For example: 
-  * 10-year plan (10-year is one token: if 10 were modifying year as an independent word, we would see 'years'+ 
-  * one-liners (note the plural -s inflects the whole 'one-liner'; separating 'one' would imply there is a word 'liners', and a subtype of that is one-liners, but actually this is the plural of the noun 'one-liner')+  * 10-year plan (10 - year plan
 +  * one-liners (one - liners)
  
 The same logic applies to participles and their argument, as well as 'self': The same logic applies to participles and their argument, as well as 'self':
-  * energy-based (1 token+  * energy-based (3 tokens
-  * self-proclaimed (1 token)+  * self-proclaimed (3 tokens)
  
-Some exceptions to keeping hyphens together are spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds): +Spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds): 
   * 10:00-12:00 (three tokens)   * 10:00-12:00 (three tokens)
   * Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')   * Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')
   * China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)   * China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)
 +
 +Exceptions which **should not** be tokenized apart include:
 +  * Morphological prefixes (re-, pre-, sub-, anti-)
 +  * Technical identifiers with hyphens (Bus A-1)
 +  * Syllabification/pronunciation guides ("it's pronounced soo-per")
  
 ===URLs and symbols from the Web=== ===URLs and symbols from the Web===
-Keep URLs together, even if they contain discernible words: +Keep URLs together, even if they contain discernible words or hyphens
-  * www.campaignForFreedom.com (1 token)+  * www.campaign-For-Freedom.com (1 token)
  
 ===Plurals with apostrophes=== ===Plurals with apostrophes===
Line 36: Line 42:
  
 ===Indicating original spacing around tokens spelled together=== ===Indicating original spacing around tokens spelled together===
 +
 Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example: Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:
  
   * We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)   * We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)
   * We distinguish original "apples / oranges" from "apples/oranges" by adding <w> around the latter (it’s three tokens either way)   * We distinguish original "apples / oranges" from "apples/oranges" by adding <w> around the latter (it’s three tokens either way)
-  * contractions such as "didn't" do **not** get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. "did" and "n't") were originally written without an intervening space.+  * contractions such as "didn't", "I'm" do **not** get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. "did" and "n't") were originally written without an intervening space.
  
 The <w> tag is **not** used in cases of morphologically complex words which are analyzed as single tokens, such as: The <w> tag is **not** used in cases of morphologically complex words which are analyzed as single tokens, such as:
gum/tokenization_segmentation.txt · Last modified: 2021/09/21 00:38 by nv214