User Tools

Site Tools


Entity and Information Status Annotation

Delimiting markables

Markables for annotation include any referential NPs (cf. Dipper et al. 2007), including pronouns, e.g. [the farm], [it]. Non referring NPs include idiomatic NPs such a 'on the other hand' – in this case, 'the other hand' is not annotated.

Copula Predicates (A is B)

  • Unlike OntoNotes (Weischedel et al. 2012), copula predicates are markables, whether or not they are coreferred to separately from the subject: [John] is [a teacher]. Negative predication precludes coreference, but markables should still be annotated for negated copula predicates:
    • [John] is [a teacher], and [a teacher] must always…
    • [John] is not [a teacher] (no coreference)
  • The same reasoning is applied to cases such as [A] is considered [a kind B] - both markables are created and they co-refer.
  • Similarly, “[A] is like [B]” or “[A] is similar to [B]” (two markables, no coreference)
  • Copula coreference is not annotated for modal predication, since the copular identity is not complete and this can lead to contradictions:
    • [This] can often be [an advantage] but can sometimes also be [a disadvantage] (no coreference)
  • Cases with “[A] as [B]” may be annotated as coreferent if they imply identity of referents, similar to copula predication. For example:
    • YES coreferent: [John] worked as [a baker] (the baker is then literally the same as John, compare [John] is [a baker])
    • NOT coreferent: [John] is the same as [my brother] (John is not my brother; this is another form of the 'similar to' case above)

What to include in the markable

  • In most cases, the span of the markable will be the entire NP including modifiers, such as prepositional phrases that belong to the NP, possessors or genitives. Thus in “[the boy with the blue coat] saw [her]” the spans of the two markables are maximal, as delineated by the brackets.
  • Exceptions to the rule of including all modifiers include clause expansions. These clauses are not included inside the markable:
    • Relative clauses (acl:relcl): I saw [the boy] who came to class
    • Participial post-modifiers (acl): [The city of York], founded by the Romans (not [The city of York, founded by the Romans])
  • In some cases, adverbs such as “here” or “there” when referring to a place that has already been mentioned (or similarly “then” for a time) will also need to be made into markables, e.g. “…went to [London]. [There] …”.
  • Occasionally, an entire sentence or clause will be referred to by a second referring expression. In this case the entire sentence being referred to will be made into a markable, usually an event: “[the rain flooded the village.] ← [it] was terrible”.
  • When such reference does not occur, sentences are not made into markables. Note that if the entire sentence is inside the markable, then sentence final punctuation is also included in the markable (here the final period).
  • It is possible that a sentence antecedent of a pronoun has been mentioned even before its most recent occurrence. In this case, previous occurrences will also be marked. For example, the clause “push the button for every floor” is considered given because the VP was mentioned in a previous title. In this case, all three markables should be linked together:
    1. [Push all the buttons]new
    2. When you get into the elevator, [push the button for every floor]giv
    3. [This]giv makes everyone's ride on the elevator longer, if just for a few seconds
  • Titles and epithets: Words like Mr., but also roles like President, are part of a markable and do not consitute an apposition. As the third example below shows, plural titles can refer to an entire coordinate markable, each constituent of which is a markable without the title (since a plural title belongs only to the coordination).
    • [Mr. Smith]
    • [President Carter]
    • [Rappers [MC Hammer]person|new and [2Pac]person|new]person|new

Coordination (A and B)

  • Coordinate phrases generally receive multiple markables, e.g. “[restaurants] and [hotels]”. If both are referred to together, an additional markable is added: … [ [restaurants] and [hotels] ]. [They] are always expensive.
  • In cases where two (or more) nouns are not full NPs, i.e. when they share an article, we only assign one markable by default:
    • She is [my wife and best friend] (one markable, since both are determined by 'my')
  • Only in the rare event when both nouns within such a markable are separately referred to later, the two submarkables are also annotated, or if both submarkables would have different entity types.
  • If there is aggregate mention to a mixed type markable, the entity type is 'abstract', e.g.:
    • We saw [ [a sheep]animal and [a bottle]object]abstract. [They]abstract were both white.

Other Specific Cases

State Names

  • States are generally available as individual referents, but City+State also form a markable. Note that the city markable is therefore longer. If Ohio is referred to later on in the text, it is coreferent with the smaller markable.
    • [Cleveland, [Ohio]]
  • Abbreviations such as [OH] for Ohio may also be accepted as markables.
  • Entity names within complex tokens are not annotated, e.g. if we have a token 'church-related', we cannot annotate just the subpart 'church' as a markable, nor will we annotate the entire 'church-related' as a markable. These cases are left out of the annotation.

Naming Constructions

In cases where a name for something is given, that name is not coreferent with the thing being named, but may form an abstract entity. The rationale for this is that subsequent reference to the name would be possible, and would generally correspond to an abstract name notion.

  • [John]person called [himself]person “ [The Terminator]abstract ” . [This name]abstract … (note “this name” corefers to “the Terminator”, but not “John”, and is abstract)
  • Subsequent references to the name will only coreference the name markable, and not the thing being named. The name bridges to the thing being named.


  • Full dates receive 3 markables: one for the month, one for the year, and one for the whole date (since the day is the head): [ [March] 12 [2012] ] (all three markables are 'time', and all would be accessible on first mention)
  • Months whose specific year can be resolved by referring back to a year mentioned earlier in the text are seen as bridging: 2007 …. ←bridge- October

Citations and references in Wikipedia

Authorial citations with author names are taken to be references to the author(s) and year (similar to “Smith said in 2009”):

  • [Smith]person [2009]time

Numerical links are taken to be mentions of the work, and therefore abstract. Coreference is also marked for each matching citation number:

  • … has been shown in the past ( [17]abstract ) Other studies disagreed … (see ←coref– [17]abstract )

Specific non-referential NPs

The following examples are not considered referential NPs:

  • “Every time (that)…” (i.e. 'every time' meaning 'when' or 'always')
  • “all the time”
  • “on a daily basis”
  • “(in) line (with)” (not an instance of a 'line')
  • “take time”

Information status

  • Information status has three possible values:
    • new - not mentioned before, first mention
    • giv - mentioned before, must be linked to previous mention
    • acc - accessible - not mentioned before, but immediately available, requiring no introduction. This includes:
      • generics - [the sun]acc, [the world]acc
      • indexical expressions - [I]acc, [you]acc, [here]acc, [this]acc (when pointing to something)
      • absolute time expressions - [2016]acc, [October]acc, [this Friday]acc
      • major countries taken as requiring no introduction - in [the US]acc today …
      • bridging - The company … [The CEO]acc (in this case, a bridging link must be made, see Coreference below)
  • Do not overuse the accessible generic category: not every definite NP is accessible if it is the first mention in the chain. Some examples that not considered accessible:
    • [an Officer in [the [United States]acc Air Force]new]new – The US is considered accessible by the major country guideline, and the officer is new (and indefinite). Although the speaker may assume that we know what 'US Air Force' is, it needed to be introduced into the discourse model, much like the introduction of a proper name that we know.
    • In the same way, the first mention of [Barack Obama]new is tagged as new, even though his identity is available to us. The idea is that newness refers to the introduction into the discourse model.
  • Personal pronouns that are inferable in the situation (I/me, you etc.) are accessible the first time they are mentioned. They are subsequently tagged as ‘giv’, since they have already been referred to explicitly.
  • Information status for cataphors: see Coreference below.

Entity Type

There are 11 entity type:

  • person - any person, including fictitious figures, groups of people, and semi-human entities (Pinocchio)
  • place - a country (Iceland), region (Sahara)), or other place being referred to as a location (the factory - when used as a place, not to refer to the physical building)
  • organization - a company, government, sports team and others
  • object - a concrete object, possible also in a metaphorical sense (e.g. a computer file)
  • event - includes reference to nouns ('War', 'the performance') and clauses that are referred back to ('that John came')
  • time - dates, times of day, days, years…
  • substance - water, mercury, gas, poison … includes context-dependent substances, such as Skittles or baking chocolate
  • animal - any animal, potentially including bacteria, aliens and others construed as animals
  • plant
  • abstract - abstract notions (luck), emotions (excitement) or intangible properties (predisposition)
  • quantity - amounts without designation of a unit: approval went down [3%], the capacity of this jug is [3 liters]

Special notes on entity types

With the exception of bridge relations, two coreferring markables must have the same entity type. This can be tricky when two markables seem to fall into two different categories. For instance, the owners of Steve's Bar may describe it as both their [business]organization and [a dive bar that locals frequent]place. But Steve's Bar is ultimately an organization, and so all markables will have organization entities.


  • For the 'quantity' type, OntoNotes guidelines only specify “Measurements, as of weight or distance”, and this is assigned also to modifiers. GUM does not annotate modifiers in this way (e.g. [8-foot wall], but not [[8-foot] wall].
  • As in OntoNotes in practice, we accept other types of measurements, e.g. of an Internet connection speed: [8 Mbps]quantity
  • Amounts of currency are annotated as currency, since their exact realization may vary (e.g. measured in physical object coins, abstract electronic data in the bank). Only specific reference to coins or bills is annotated as an object:
    • It cost [$100]quantity
    • He gave me [a 20 dollar bill]object
  • Note that numbers standing for a known, different entity types, are taken to be that entity, e.g. [hundreds]person were killed (person and not quantity).


  • Names of businesses are generally tagged as organizations, even if they are used to indicate a location:
    • “We bought it at [Macy's]organization
  • This guideline does not apply to commercial locations used as places, such as: “at [the mall]place


  • Websites may often be considered places:
    • You can get more details at []place
    • We met on [Facebook]place


  • Indexical items like today, yesterday etc., are taken as (accessible) time terms


  • Titles of or references to authored works are abstract. Titles are coreferent to the work.
    • [The Bible]abstract vs. [this specific Bible]object


The coreference scheme is loosely based on the design principles of the OntoNotes coreferece scheme (Weischedel et al. 2012) but with more specific relation types, inspired by the TüBa-D/Z coreference schem (Teljohann et al. 2012). A major design principle is that coreference should serve to identify the discourse referent referred to by underspecified expressions such as pronouns, and allow us to track the behavior of discourse referents as their expressions evolve over the course of a discourse.

There are five types of coreference links:

  • ana - anaphoric, a pronoun referring back to something: [the woman] ←ana– [she]
  • cata - cataphotic, a pronoun referring forward to something: [it]'s impossible [to know] ([it]–cata→[to know])
  • appos - apposition, same as in syntax: [Your neighbor],←appos– [the lawyer] came by earlier.
  • bridge - bridging, in some inferrable part-whole relationship, requires no introduction thanks to the antecedent: [a car] ←bridge– [the driver]
  • coref - other types of coreference, typically lexical mention: [Obama] …. ←coref– [President Obama]

Specific guidelines


  • Ages specified after a person's name are considered appositional, following the OntoNotes guidelines. The idea is that a phrase like “[Mr. Smith], [43]” is something like:
    • “[Mr. Smith]person, [(a) 43 (year old)]person”.
  • The entity type is therefore also person for both markables in this case, and the coreference type is appos.
  • In cases where two full NP realizations of the entity are separated by a coordination such as 'and/or', the normal coref type is used, even if there is a subsequent apposition:
    • [My friend]←coref- and also [my hero]←appos-, [Mrs. Smith].
  • If two mentions share an article, they are no longer separate NPs, and they become one markable according to markable recognition rules:
    • [My friend and hero] ←appos- [Mrs. Smith]


bridging coreference occurs when two entities do not corefer exactly, but the basis for the identifiability of one referent is the previous mention of one or more previous referents. This can be because the second referent forms part of the whole described by the antecedent, or because multiple referents are aggregated into a larger referring expression (see examples below).

  • If the second referent designating a part or other predictable component of the first referent contains an explicit possessive, the possessive itself should be linked to the first phrase, and no bridging relation needs to be added (since the possessive coreference is explicit).
  • In the case of inferrable parts, the new referent is viewed as ‘accessible’ (by way of bridging).
  • Aggregate referents, i.e. group referents (Mary, Jake → they) are viewed as ‘given’ at the anaphor, but the relationship is 'bridge' (see image below).
  • Examples:
    • It was [a beautiful statue]. [The head] was made of marble.
    • [Endeavour] and [Atlantis] await a journey on [their] respective launchpads.

Not tagged as bridging
  • part with explicit possessive:
    • [The woman] raised [[her] hands]. (‘her’ is anaphoric to ‘the woman’, ‘her hands’ is not linked as bridging, though it can still be a referent)
  • Constructions with 'other', 'similar', 'different' or similar relational adjectives:
    • [one kind of wine] … [another kind of wine] (note that if previously 'wine' is mentioned, both kinds of wine should bridge to that separately, but the two distinct 'one wine' / 'another wine' should not bridge to each other since they are no in a part-whole relationship (unlike 'all wines' ←bridge- 'this specific wine')
  • generic 'you' and 'us': cases of generic 'you' are not considered subsets of the aggregation in generic 'us'. Example: “[You] can only buy the tickets in person. This is annoying for all of [us], but it's the only way” (no bridging from 'us' to 'you')


  • Multiple mentions of ‘I’, ‘me’, ‘you’, ‘your’, ‘mine’ etc. are linked via the ana relationship, just like 3rd person pronouns: [I] ←ana– [me]. In a conversation, one person's ‘I’ may corefer with a ‘you’ used by another interlocutor.
  • Instances of ‘I/me’ that are coreferent with a ‘you’ coming from the other speaker in the dialog are considered linked, via the ‘ana’ relationship like other pronoun chains: He went with [you]? ←ana– [I] went alone.
  • Pronominal 'one' is linked as ana in both generic uses ([one] usually likes [one's] house) and substitutive uses if strictly coreferent ([which one] did you get? [This one].). In a partitive context, bridge should be used (I have [beer]. Give me [one]; note that 'one' is a subset of the beer).
  • Semantically bleached nouns such as 'this fact' or 'that thing' do not constitute instances of the ana link - if the head is a noun, as in these cases, the coref label is used.
  • The indexical adverbs ‘here’ and ‘there’, when they have an explicit antecedent (e.g. 'your new place') qualify as pronouns for the purpose of coreference type, since their interpretation depends completely on the antecedent. The relation is therefore labeled ana.
  • The reciprocal reflexive phrases 'each other' are regarded as anaphoric. They are linked with either one ana relation if an aggregate plural mention already exists, or with two bridge relations, if the components of 'each other' have only been mentioned separately so far. The information status is giv in both cases.


Cataphora are pronominal or otherwise underspecified nominal elements (including e.g. ‘thing’, ‘fact’) that precede an occurrence of a non-pronominal element that occurs within the same utterance and resolves their discourse referent.

Cataphora may be annotated in copula sentences as linking to their predicate, if the reference of the pronoun is otherwise unresolvable (see example below). Unlike other relations, cataphora point forwards, from the pronoun to the expression that resolves them. Examples (the link always points forward):

  • [It]’s important [to brush your teeth]
  • In [her] address, [the chairwoman] said…

Subject in copula sentence:

  • [it]’s [a long game]
  • [This] was [a very difficult decision]
  • [One of them] is [the channels]
  • [The fact that]
  • [The things like]

If cataphora refer to an entity that has been mentioned before in discourse but not in the current utterance, they are still linked forwards as cataphora (to identify the construction in that sentence as cataphoric). In such cases, the link to mentions before the current sentence departs from the non-pronominal (later) element, back to previous elements. This is done to ensure a search for indirect reference chains is not interrupted by the change in direction thanks to the cataphor. See the image below for an example (coref is outgoing from ‘the neighborhood’ to the previous mention, ‘its’ points forward to ‘the neighborhood’):

The cataphor in this sentence is re-introducing a previously mentioned entity (the neighborhood was discussed earlier in the text).

Information status for cataphors follows the value of their coreferent (the following mention). Thus both a cataphor and its subsequent mention may be considered new (or given, if mentioned before the cataphor as well).


The coref type is used for all other types of lexical coreference. Some specific tricky cases that ARE included are:

  • Distributive 'each' phrases, when the 'each' phrase ultimately covers the entire set of referents in another mention. For example:
    • We invited [two groups], ←coref- [each group] paid separately. (although 'each group' is singular, ultimately the predicate 'paid' applies to both groups, and the phrase 'each group' can be considered to cover the same set denotation as 'two groups')
gum/entities.txt · Last modified: 2019/12/05 15:47 by lg876