User Tools

Site Tools


gum:tei_markup_in_gum

TEI XML tags used in GUM

Formatting and content markup

tag@attribute meaning
caption caption for images in the text
date date expressions
date@from starting date for a range of dates
date@notAfter latest possible date for an inexact date
date@notBefore earliest possible date for an inexact date
date@rend formatting of a date expression (e.g. italics, color)
date@to end date for a range of dates
date@when date in question, normalized to the format yyyy-mm-dd
figure marks the position of a figure in the text
figure@rend a description of the appearance of the figure
head marks a heading
head@rend a description of the appearance of the heading (e.g. bold)
hi@rend a highlighted section with a description of its appearance (e.g. color)
incident@who an extralinguistic incident (e.g. coughing), and the person responsible
item item or bullet point in a list
item@n item number
l@n a line in poetry with its number
lg@n a line group with the group's number
lg@type line group type (e.g. stanza)
list list of bullet points
list@type a list type (e.g. bulleted, ordered, etc.)
p a paragraph
p@rend a description of the appearance of the paragraph
q quotation marks not marking a quotation (e.g. scare quotes; placed outside the quotes!)
quote a quotation
ref an external reference, usually a hyperlink
ref@target the target of the reference (usually a URL, if not ommitted)
s a main sentence span
sic a section containing an apparent language error, thus in the original
sp@who a section uttered by a particular speaker with a reference to the speaker
time time expressions
time@from starting time for a stretch of time
time@to end time for a stretch of time
time@when time in question, normalized to the format HH:mm, e.g. 16:30
w tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot

Guidelines

Errors and spelling variation

Obvious typos and errors should be surrounded by the sic tag but not corrected. Later in lemmatization, they will receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.

I know <sic>th</sic> way.
Your coat is a lovely colour.
Dates

Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:

  • On <date when=“2015-05-05”>Tuesday</date>
  • Dates may have rendering, and free standing dates are considered independent <s> units:
  <s><date when="2015-05-07" rend="bold">Thursday, May 7, 2015</date></s>
  • Partial dates are possible, such as years or months of years: <date when=“2016”>2016</date> <date when=“2016-03”>March 2016</date>
  • Circa dates. If it's clear from context the possible date range, include the circa token in the tag and specify the possible date range: <date notBefore=“1898-01-01” notAfter=“1898-12-31”>c. 1898</date>. If it's unclear from context, leave the circa token out: c. <date when=“1200”>1200</date>.
  • Names for ranges of dates are supplied using @from and @to, e.g. <date from=“1990” to=“1999”>The 90s</date>.
  • If a date range is given explicitly, use two tags with @when: <date when=“1990”>1990</date> - <date when=“2000”>2000</date>
  • Years before 0000 (i.e. BC) receive a minus, but still have four digits: <date when=“-0128”>128 BC</date>
Lists

Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.

  • The list element has a @type attribute to distinguish the two.
  • Each list item in an ordered list carries an attribute @n to designate the number. When @n is used, there is no need to make the number into a token as well: the number is considered a part of the styling, and not a token.
  • List items typically contain one or more paragraphs (<p> elements). Unlike headings, even if the list contains only one paragraph, a <p> element is used to distinguish its text flow (indentation, separation from rest of text), and for consistency in cases where a single list item has multiple paragraphs.

The following example illustrates markup for a numbered list:

  <list type="ordered">
  <item n="1"> 
  <!-- the number 1 is not a token, even though it appeared in the text-->
  <p>This is the first step
Figures

Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).

  • Figures that have a caption surround the caption element, and the caption itself contains tokens that are annotated as usual (since they actually appear in, and are part of the text):
  <figure rend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>
  • Figures may contain tokenizable text, e.g. if the figure is meant to be read.
  <figure rend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>
  • Figures without captions or other tokeinzable text are left empty, but enclosed by figure tags:
  <figure rend="picture of a valley"></figure>

Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.

Values for rend

Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:

  <hi rend="bold italic large">The Big Picture</hi>
Quotation marks

Literal quotes are surrounded by the <quote> tags, regardless of whether or not quotation marks are used. But other uses of quotation marks are surrounded by <q>. Compare the following two uses:

  Caesar said <quote>veni, vidi, vici</quote>. You could say that was his <q>" motto "</q>.
Footnotes

Footnotes with running text (not bibliographical references realized using numbers hyperlinked to the bibliography) are place at the position immediately after the paragraph that contains the numbered references. The number is surrounded by ref tags, and the note is enclose in note:

<p>
Some long text.<ref>1</ref> Paragraph continues. At the end of this paragraph we'll insert the note.
</p>
<note place="foot" n="1">This is the footnote, which physically appeared at the bottom of the page, which was the middle of the next paragraph.</note>
<p>
Next paragraph. This one is split across pages, but the footnote does not appear in the middle of it, even though it was there graphically.
</p>
Reference to deleted speakers

If a deleted comments in reddit is not replied to within the context included in the document, it may be ignored. However if the comment is part of a broken thread of responses, it's existence can be encoded using an empty sp tag with the speaker set to DELETED, which can then be referred to in the reply:

<sp who="#DELETED"/>
<sp who="#kim" whom="#DELETED">
I agree with you.
</sp>
Reference to multiple speakers

If two characters in a work of fiction say the same thing at the same time, tag both speakers in alphabetical order, separated by a comma (without a space), in the sp@who attribute:

<p>
<sp who="#Fairy,#Narrator">
“No!”
</sp> 
we both said at once.
</p>
Tokens with no intervening spaces

Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot:

  I <w>cannot</w> do this (5 tokens)

Structural markup

NOTE: As of GUM2, the following tags are no longer used

tag@attribute meaning
div1 major, top level section
div1@n section number for div1
div1@type type of section (e.g. section, chapter, etc.)
div2 same as div1 for a second level nested section
div2@n
div2@type
div3 same as div1 for a third level nested section
div3@n
gum/tei_markup_in_gum.txt · Last modified: 2018/09/17 14:39 by rem132