This is an old revision of the document!
tag@attribute | meaning |
---|---|
add | TODO |
caption | caption for images in the text |
caption@rend | a description of the appearance of the caption |
cell | TODO |
cell@rend | a description of the appearance of the cell |
date | date expressions |
date@from | starting date for a range of dates |
date@notAfter | latest possible date for an inexact date |
date@notBefore | earliest possible date for an inexact date |
date@rend | formatting of a date expression (e.g. italics, color) |
date@to | end date for a range of dates |
date@when | date in question, normalized to the format yyyy-mm-dd |
figure | marks the position of a figure in the text |
figure@rend | a description of the appearance of the figure |
head | marks a heading |
head@rend | a description of the appearance of the heading (e.g. bold) |
hi@rend | a highlighted section with a description of its appearance (e.g. color) |
incident@who | an extralinguistic incident (e.g. coughing), and the person responsible |
item | item or bullet point in a list |
item@n | item number |
l@n | a line in poetry with its number |
lg@n | a line group with the group's number |
lg@type | line group type (e.g. stanza) |
list | list of bullet points |
list@type | a list type (e.g. bulleted, ordered, etc.) |
p | a paragraph |
p@rend | a description of the appearance of the paragraph |
q | quotation marks not marking a quotation (e.g. scare quotes; placed outside the quotes!) |
quote | a quotation |
ref | an external reference, usually a hyperlink |
ref@target | the target of the reference (usually a URL, if not ommitted) |
s | a main sentence span |
sic | a section containing an apparent language error, thus in the original |
sp@who | a section uttered by a particular speaker with a reference to the speaker |
sp@whom | a section uttered with a particular speaker as an addressee |
time | time expressions |
time@from | starting time for a stretch of time |
time@to | end time for a stretch of time |
time@when | time in question, normalized to the format HH:mm, e.g. 16:30 |
w | tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot |
Obvious typos and errors should be surrounded by the sic tag but not corrected. Later in lemmatization, they will receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.
I know <sic>th</sic> way. Your coat is a lovely colour.
Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:
<s><date when="2015-05-07" rend="bold">Thursday, May 7, 2015</date></s>
Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.
The following example illustrates markup for a numbered list:
<list type="ordered"> <item n="1"> <!-- the number 1 is not a token, even though it appeared in the text--> <p>This is the first step
Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).
<figure rend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>
<figure rend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>
<figure rend="picture of a valley"></figure>
Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.
Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:
<hi rend="bold italic large">The Big Picture</hi>
Literal quotes are surrounded by the 'quote' tags, regardless of whether or not quotation marks are used. But other uses of quotation marks are surrounded by 'q'. Compare the following two uses:
Caesar said <quote>veni, vidi, vici</quote>. You could say that was his <q>" motto "</q>.
Footnotes with running text (not bibliographical references realized using numbers hyperlinked to the bibliography) are place at the position immediately after the paragraph that contains the numbered references. The number is surrounded by ref tags, and the note is enclose in note:
<p> Some long text.<ref>1</ref> Paragraph continues. At the end of this paragraph we'll insert the note. </p> <note place="foot" n="1">This is the footnote, which physically appeared at the bottom of the page, which was the middle of the next paragraph.</note> <p> Next paragraph. This one is split across pages, but the footnote does not appear in the middle of it, even though it was there graphically. </p>
If a deleted comments in reddit is not replied to within the context included in the document, it may be ignored. However if the comment is part of a broken thread of responses, it's existence can be encoded using an empty sp tag with the speaker set to DELETED, which can then be referred to in the reply:
<sp who="#DELETED"/> <sp who="#kim" whom="#DELETED"> I agree with you. </sp>
If two characters in a work of fiction say the same thing at the same time, tag both speakers in alphabetical order, separated by a comma (without a space), in the sp@who attribute:
<p> <sp who="#Fairy,#Narrator" whom="#Pete"> “No!” </sp> we both said at once. </p>
If there are multiple possible addressees and it is not clear who/which subset is being addressed, all possible addressees are included in sp@whom (usually everyone but the speaker). Speech uttered to no specific addressee is left without the @whom attribute.
Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot:
I <w>cannot</w> do this (5 tokens)
If there are some graphic section dividers, which seperate different sections of the text but do not contain any words, tag them as the following example:
<p> <s>* * *</s> </p>
NOTE: As of GUM2, the following tags are no longer used
tag@attribute | meaning |
---|---|
div1 | major, top level section |
div1@n | section number for div1 |
div1@type | type of section (e.g. section, chapter, etc.) |
div2 | same as div1 for a second level nested section |
div2@n | |
div2@type | |
div3 | same as div1 for a third level nested section |
div3@n |