User Tools

Site Tools


gum:tokenization_segmentation

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
gum:tokenization_segmentation [2021/02/11 16:44]
127.0.0.1 external edit
gum:tokenization_segmentation [2021/09/21 00:38]
nv214
Line 7: Line 7:
  
 ===Hyphenation=== ===Hyphenation===
-As a general rule, hyphenated words should be kept together. This is especially true of words that are determinative compoundswhere the modifier cannot take a plural form and does not constitute an independent word. For example: +As a general rule, hyphenated words should be splitsince they can often be spelled apart. For example: 
-  * 10-year plan (10-year is one token: if 10 were modifying year as an independent word, we would see 'years'+ 
-  * one-liners (note the plural -s inflects the whole 'one-liner'; separating 'one' would imply there is a word 'liners', and a subtype of that is one-liners, but actually this is the plural of the noun 'one-liner')+  * 10-year plan (10 - year plan
 +  * one-liners (one - liners)
  
 The same logic applies to participles and their argument, as well as 'self': The same logic applies to participles and their argument, as well as 'self':
-  * energy-based (1 token+  * energy-based (3 tokens
-  * self-proclaimed (1 token)+  * self-proclaimed (3 tokens)
  
-Some exceptions to keeping hyphens together are spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds): +Spans of time, where the hyphen means from-to, and hyphens coordinating items on the same level (copulative, non-determinative compounds): 
   * 10:00-12:00 (three tokens)   * 10:00-12:00 (three tokens)
   * Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')   * Bill Clinton-Al Gore relationship (6 tokens, otherwise we have a token 'Clinton-Al')
   * China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)   * China-Russia (a copulative compound where both members have the same status, not a subtype of Russia in a determinative reading)
 +
 +Exceptions which **should not** be tokenized apart include:
 +  * Morphological prefixes (re-, pre-, sub-, anti-)
 +  * Technical identifiers with hyphens (Bus A-1)
 +  * Syllabification/pronunciation guides ("it's pronounced soo-per")
  
 ===URLs and symbols from the Web=== ===URLs and symbols from the Web===
-Keep URLs together, even if they contain discernible words: +Keep URLs together, even if they contain discernible words or hyphens
-  * www.campaignForFreedom.com (1 token)+  * www.campaign-For-Freedom.com (1 token)
  
 ===Plurals with apostrophes=== ===Plurals with apostrophes===
Line 36: Line 42:
  
 ===Indicating original spacing around tokens spelled together=== ===Indicating original spacing around tokens spelled together===
 +
 Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example: Items which originally were spelled together but which will be tokenized separately should be surrounded with the <w> tag to indicate that there was no space between them in the original text (unless original spacing is trivial to infer). For example:
  
   * We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)   * We distinguish original “can not” from “cannot” by adding <w> around the latter (it’s two tokens either way)
   * We distinguish original "apples / oranges" from "apples/oranges" by adding <w> around the latter (it’s three tokens either way)   * We distinguish original "apples / oranges" from "apples/oranges" by adding <w> around the latter (it’s three tokens either way)
-  * contractions such as "didn't" do **not** get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. "did" and "n't") were originally written without an intervening space.+  * contractions such as "didn't", "I'm" do **not** get surrounded by <w>, as it is trivial to infer that the two tokens (i.e. "did" and "n't") were originally written without an intervening space.
  
 The <w> tag is **not** used in cases of morphologically complex words which are analyzed as single tokens, such as: The <w> tag is **not** used in cases of morphologically complex words which are analyzed as single tokens, such as:
Line 92: Line 99:
  
 The 'multiple' category does not apply when there is a **main clause** of one type and **a subordinate clause** of a different type, e.g. "washing the dishes, John noticed the burglar" - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type ("ger"), since there is really only one main matrix clause: the past tense one with "noticed". The 'multiple' category does not apply when there is a **main clause** of one type and **a subordinate clause** of a different type, e.g. "washing the dishes, John noticed the burglar" - in this case, we have a normal declarative clause that has a subordinate gerund. It is not a gerund type ("ger"), since there is really only one main matrix clause: the past tense one with "noticed".
 +
 +The 'multiple' category also does not apply when parenthetical sentences are present; parenthetical sentences may be 'below the level' of the main clause, and so only the type of the main clause applies. For example, the following is a 'sub' type, notwithstanding the parenthetical clause in italics:
 +
 +  *  I would say only that if some of my judgments were wrong--//and some were wrong//--they were made in what I believed at the time to be the best interest of the Nation
  
 ==Prioritization when multiple types apply== ==Prioritization when multiple types apply==
gum/tokenization_segmentation.txt · Last modified: 2021/09/21 00:38 by nv214