User Tools

Site Tools


TEI XML tags used in GUM

Formatting and content markup

tag@attribute meaning
caption caption for images in the text
date date expressions
date@from starting date for a range of dates
date@notAfter latest possible date for an inexact date
date@notBefore earliest possible date for an inexact date
date@rend formatting of a date expression (e.g. italics, color)
date@to end date for a range of dates
date@when date in question, normalized to the format yyyy-mm-dd
figure marks the position of a figure in the text
figure@rend a description of the appearance of the figure
head marks a heading
head@rend a description of the appearance of the heading (e.g. bold)
hi@rend a highlighted section with a description of its appearance (e.g. color)
incident@who an extralinguistic incident (e.g. coughing), and the person responsible
item item or bullet point in a list
item@n item number
l@n a line in poetry with its number
lg@n a line group with the group's number
lg@type line group type (e.g. stanza)
list list of bullet points
list@type a list type (e.g. bulleted, ordered, etc.)
p a paragraph
p@rend a description of the appearance of the paragraph
quote a quotation
ref an external reference, usually a hyperlink
ref@target the target of the reference (usually a URL, if not ommitted)
s a main sentence span
sic a section containing an apparent language error, thus in the original
sp@who a section uttered by a particular speaker with a reference to the speaker
time time expressions
time@from starting time for a stretch of time
time@to end time for a stretch of time
time@when time in question, normalized to the format HH:mm, e.g. 16:30
w tag to delimit a word, used when two tokens are spelled with no space, e.g. cannot


Errors and spelling variation

Obvious typos and errors should be surrounded by the sic tag but not corrected. Later in lemmatization, they will receive correctly spelled lemmas. Note that British spelling is not considered an error and should not be marked up in any special way.

I know <sic>th</sic> way.
Your coat is a lovely colour.

Dates are marked up using the date element, usually with the @when attribute, in the yyyy-mm-dd format. It is possible to annotate dates fully, if they are known from context, even if the text mentions a partial date, e.g.:

  • On <date when=“2015-05-05”>Tuesday</date>
  • Dates may have rendering, and free standing dates are considered independent <s> units:
  <s><date when="2015-05-07" rend="bold">Thursday, May 7, 2015</date></s>
  • Partial dates are possible, such as years: <date when=“2016”>2016</date>
  • Names for ranges of dates are supplied using @from and @to, e.g. <date from=“1990” to=“1999”>The 90s</date>
  • Years before 0000 (i.e. BC) receive a minus, but still have four digits: <date when=“-0128”>128 BC</date>

Numbered lists and unnumbered bullet points are considered parts of structural markup, and both are a type of <list>.

  • The list element has a @type attribute to distinguish the two.
  • Each list item in an ordered list carries an attribute @n to designate the number. When @n is used, there is no need to make the number into a token as well: the number is considered a part of the styling, and not a token.
  • List items typically contain one or more paragraphs (<p> elements). Unlike headings, even if the list contains only one paragraph, a <p> element is used to distinguish its text flow (indentation, separation from rest of text), and for consistency in cases where a single list item has multiple paragraphs.

The following example illustrates markup for a numbered list:

  <list type="ordered">
  <item n="1"> 
  <!-- the number 1 is not a token, even though it appeared in the text-->
  <p>This is the first step

Figures are surrounded by <figure> tags. Although the figures themselves are not preserved in the corpus, they can be described in the attribute @rend of the figure element. Descriptions are only made in the @rend attribute and are not added to the tokens of the text itself (for this reason, the alternative TEI method of using <figureDesc> is NOT used).

  • Figures that have a caption surround the caption element, and the caption itself contains tokens that are annotated as usual (since they actually appear in, and are part of the text):
  <figure rend="Picture of Queen Elizabeth II"><caption>The Queen in Beijing last year</caption></figure>
  • Figures may contain tokenizable text, e.g. if the figure is meant to be read.
  <figure rend="list of suspects in the case and their mug shots">CEO - secretary - ambassador</figure>
  • Figures without captions or other tokeinzable text are left empty, but enclosed by figure tags:
  <figure rend="picture of a valley"></figure>

Pop up image descriptions or tooltips (in HTML, things like 'alt' or 'title') are not considered running tokens of the text. They may optionally be included in @rend if desired.

Values for rend

Typographical information for spans of text in hi@rend should be single words where possible, often derived from corresponding CSS vocabulary. For example, we use 'bold', 'italic', and 'large'. Multiple values are possible and should be separated by spaces:

  <hi rend="bold italic large">The Big Picture</hi>
Tokens with no intervening spaces

Some tokens that are spelled together cannot be trivially recognized as such after tokenization. Whereas n't or 'll are easy, can + not can be can not or cannot. To distinguish the latter case, we can add the tag <w> for 'word' to the case cannot:

  I <w>cannot</w> do this (5 tokens)

Structural markup

tag@attribute meaning
div1 major, top level section
div1@n section number for div1
div1@type type of section (e.g. section, chapter, etc.)
div2 same as div1 for a second level nested section
div3 same as div1 for a third level nested section
gum/tei_markup_in_gum.txt · Last modified: 2016/09/28 16:56 by amir