User Tools

Site Tools


Re temperature symbols: After talking to Amir, I separated the temperatures given in my document into three separate tokens (For example: 35°C –> 35, °, and C on separate lines). Rationale = F/C are different compositional constructs and should be treated similarly to currency ($ is a different construct than £).

Re pluralized collective entities: I considered the following sentence to not have an error because I think this might be a trait of British English: “The secularist magazine MicroMega describe the court's judgement as historic.” The reference to a collective (such as a magazine, a team, a group) as a plural is the phenomenon in question.

Re elided words: I counted elided words (in Italian) as two tokens, but I did not include the apostrophe as a separate token. One example is l' and Ici and another is c' and e(with accent). I put these on separate lines since one of them is a contracted word, just like English contractions. l' is the definite article and c' is ci, a clitic meaning “there.”

Re quotation marks: I marked quotation marks surrounding quoted written text using <q></q> rather than the mark for quoted speech (<quote></quote>).

Re sentence tagging: Under SOURCES, I parsed the reference title and the date and source (magazine, etc.) as a single sentence. For example: <item><p><ref><s><q>“Sensitive government document found on rainy Ottawa street”</q></ref> — <ref></ref>,<date when=“2008-08-15”>August 15, 2008</date></s></p></item>

gum/start.txt · Last modified: 2021/02/11 16:44 (external edit)