This is an old revision of the document!
|caption||caption for images in the text|
|date@from||starting date for a range of dates|
|date@notAfter||latest possible date for an inexact date|
|date@notBefore||earliest possible date for an inexact date|
|date@rend||formatting of a date expression (e.g. italics, color)|
|date@to||end date for a range of dates|
|date@when||date in question, normalized to the format yyyy-mm-dd|
|figure||marks the position of a figure in the text|
|figure@rend||a description of the appearance of the figure|
|head||marks a heading|
|head@rend||a description of the appearance of the heading (e.g. bold)|
|hi@rend||a highlighted section with a description of its appearance (e.g. color)|
|incident@who||an extralinguistic incident (e.g. coughing), and the person responsible|
|item||item or bullet point in a list|
|l@n||a line in poetry with its number|
|lg@n||a line group with the group's number|
|lg@type||line group type (e.g. stanza)|
|list||list of bullet points|
|list@type||a list type (e.g. bulleted, ordered, etc.)|
|p@rend||a description of the appearance of the paragraph|
|q||quotation marks not marking a quotation (e.g. scare quotes; placed outside the quotes!)|
|ref||an external reference, usually a hyperlink|
|ref@target||the target of the reference (usually a URL, if not ommitted)|
|s||a main sentence span|
|sic||a section containing an apparent language error, thus in the original|
|sp@who||a section uttered by a particular speaker with a reference to the speaker|
|time@from||starting time for a stretch of time|
|time@to||end time for a stretch of time|
|time@when||time in question, normalized to the format HH:mm, e.g. 16:30|
|w||tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot|
Obvious typos and errors should be surrounded by the sic tag but not corrected. Later in lemmatization, they will receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.
I know <sic>th</sic> way. Your coat is a lovely colour.
Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:
<s><date when="2015-05-07" rend="bold">Thursday, May 7, 2015</date></s>
Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.
The following example illustrates markup for a numbered list:
<list type="ordered"> <item n="1"> <!-- the number 1 is not a token, even though it appeared in the text--> <p>This is the first step
Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).
<figure rend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>
<figure rend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>
<figure rend="picture of a valley"></figure>
Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.
Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:
<hi rend="bold italic large">The Big Picture</hi>
Literal quotes are surrounded by the <quote> tags, regardless of whether or not quotation marks are used. But other uses of quotation marks are surrounded by <q>. Compare the following two uses:
Caesar said <quote>veni, vidi, vici</quote>. You could say that was his <q>" motto "</q>.
Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot:
I <w>cannot</w> do this (5 tokens)
NOTE: As of GUM2, the following tags are no longer used
|div1||major, top level section|
|div1@n||section number for div1|
|div1@type||type of section (e.g. section, chapter, etc.)|
|div2||same as div1 for a second level nested section|
|div3||same as div1 for a third level nested section|