User Tools

Site Tools


gum:entities

Entity and Information Status Annotation

Delimiting markables

Markables for annotation include:

  • Any referential NPs (cf. Dipper et al. 2007), including pronouns, e.g. [the farm], [it].
  • Non referring NPs are not annotated, and include idiomatic NPs such a 'on the other hand' – in this case, 'the other hand' is not annotated, or non-referential pronouns, such as 'it' in 'it rained'.
  • Verbal markables are annotated only if they are referred back to, e.g. “Kim [visited Seoul]. [The visit] went well”.
  • Cataphoric expletives are annotated, as are the clauses they represent, such as “[it] is clear [Kim was there]”.

Some more specific constructions and guidelines follow:

Copula Predicates (A is B)

  • Unlike OntoNotes (Weischedel et al. 2012), copula predicates are markables, whether or not they are coreferred to separately from the subject: [John] is [a teacher]. Negative predication precludes coreference, but markables should still be annotated for negated copula predicates:
    • [John]1 is [a teacher]1, and [a teacher]2 must always… (note second “a teacher” refers to all teachers, not just John)
    • [John] is not [a teacher] (no coreference)
  • The same reasoning is applied to cases such as [A] is considered [a kind B] - both markables are created and they co-refer.
  • Similarly, “[A] is like [B]” or “[A] is similar to [B]” (two markables, no coreference)
  • Copula coreference is not annotated for modal predication, since the copular identity is not complete and this can lead to contradictions:
    • [This] can often be [an advantage] but can sometimes also be [a disadvantage] (no coreference)
  • Cases with “[A] as [B]” may be annotated as coreferent if they imply identity of referents, similar to copula predication. For example:
    • YES coreferent: [John] worked as [a baker] (the baker is then literally the same as John, compare [John] is [a baker])
    • NOT coreferent: [John] is the same as [my brother] (John is not my brother; this is another form of the 'similar to' case above)

What to include in the markable

  • In most cases, the span of the markable will be the entire NP including modifiers, such as prepositional phrases that belong to the NP, possessors or genitives. Thus in “[the boy with the blue coat] saw [her]” the spans of the two markables are maximal, as delineated by the brackets.
  • Covering all modifiers includes clausal expansions, such as relative an infinitival clauses, as long as they are expanding the head noun. The following examples show clauses that are included inside the markable:
    • Relative clauses (acl:relcl): I saw [the boy who came to class]
    • Participial post-modifiers (acl): [The city of York, founded by the Romans] (not [The city of York], founded by the Romans)
    • Infinitival post-modifier (acl): [A chance to win]
  • Relatives expanding a controlling verb are not included in a noun's markable:
    • We helped [Joey], which really saved the day ('helping' saved the day, not 'Joey')
  • In some cases, adverbs such as “here” or “there” when referring to a place that has already been mentioned (or similarly “then” for a time) will also need to be made into markables, e.g. “…went to [London]. [There] …”. Such adverbs are not annotated if they are not referred back to using a noun phrase or pronoun.
  • Occasionally, an entire sentence or clause will be referred to by a second referring expression. In this case the entire sentence being referred to will be made into a markable, usually an event: “[the rain flooded the village.] ← [it] was terrible”.
  • When such reference does not occur, sentences are not made into markables. Note that if the entire sentence is inside the markable, then sentence final punctuation is also included in the markable (here the final period).
  • It is possible that a sentence antecedent of a pronoun has been mentioned even before its most recent occurrence. In this case, previous occurrences will also be marked. For example, the clause “push the button for every floor” is considered given because the VP was mentioned in a previous title. In this case, all three markables should be linked together:
    1. [Push all the buttons]new
    2. When you get into the elevator, [push the button for every floor]giv
    3. [This]giv makes everyone's ride on the elevator longer, if just for a few seconds
  • Titles and epithets: Words like Mr., but also roles like President, are part of a markable and do not consitute an apposition. As the third example below shows, plural titles can refer to an entire coordinate markable, each constituent of which is a markable without the title (since a plural title belongs only to the coordination).
    • [Mr. Smith]
    • [President Carter]
    • [Rappers [MC Hammer]person|new and [2Pac]person|new]person|new

Coordination (A and B)

  • Coordinate phrases generally receive separate markables for each component, e.g. “[restaurants] and [hotels]”. If both are referred to together, an additional markable is added: … [ [restaurants] and [hotels] ]. [They] are always expensive.
  • In cases where two (or more) nouns are not full NPs, i.e. when they share an article, we only assign one markable by default:
    • She is [my wife and best friend] (one markable, since both are determined by 'my')
  • Only in the rare event when both nouns within such a markable are separately referred to later, the two submarkables are also annotated, or if both submarkables would have different entity types. For example:
    • We saw [the [car] and [driver]] . [The car] was black and [the driver's] uniform matched [its] color.
  • If there is aggregate mention to a mixed type markable, the entity type is 'abstract', e.g.:
    • We saw [ [a sheep]animal and [a bottle]object]abstract. [They]abstract were both white.

Interrupted and repaired markables

If a repair results in two separate NPs (even if incomplete), both are annotated, and can be coreferent in context. This can be identified by presence of either separate articles or head nouns. Compare:

  • [The whol-]object, [the whole thing]object (two articles, so two markables)
  • [some brown dog-]animal, uh [brown dogs]animal (two head nouns, so two markables)
  • [a brow- uh black dog]animal (single article and head noun, so one markable)

Other Specific Cases

State Names

  • States are generally available as individual referents, but City+State also form a markable. Note that the city markable is therefore longer. If Ohio is referred to later on in the text, it is coreferent with the smaller markable.
    • [Cleveland, [Ohio]]
  • Abbreviations such as [OH] for Ohio may also be accepted as markables.
  • Entity names within complex tokens are not annotated, e.g. if we have a token 'church-related', we cannot annotate just the subpart 'church' as a markable, nor will we annotate the entire 'church-related' as a markable. These cases are left out of the annotation.

Naming Constructions

In cases where a name for something is given, that name is coreferent with the thing being named, unless the name is being discussed as such, in which case it may form an abstract entity. The rationale for this is that subsequent reference to the name as a concept supercedes its reference to the thing names.

  • [John]person called [himself]person “ [The Terminator]abstract ” . [This name]abstract … (note “this name” corefers to “the Terminator”, but not “John”, and is abstract)
  • Simple 'calling' constructions are interpreted as plain, nested coreference: “[A Boy named [Sue]person]person (with coreference)
  • Subsequent references to a name can corefer to the name markable if discussion of the name is intended, and not the thing being named. In such cases, the first mention of the name bridges to the most recent mention of the thing being named.

Dates

  • Full dates receive 2-3 markables: one for the month, one for the year, and one for the whole date (since the day is the head), but only if this does not create two identical spans:
    • [ 12 [March [2012]] ] (all three markables are 'time')
    • [ March 12 [2012] ] (a separate markable corresponding to the month is not possible, since 'March .. 2012' covers the exact same span of tokens as the entire date)
  • Months whose specific year can be resolved by referring back to a year mentioned earlier in the text are seen as bridging: 2007 …. ←bridge- October

Citations and references in Wikipedia

Authorial citations with author names are taken to be references to the author(s) and year (similar to “Smith said in 2009”):

  • [Smith]person [2009]time

At the same time, the entire reference is taken to be an abstract entity, so we add a third markable:

  • [[Smith]person [2009]time]abstract

Numerical links are taken to be mentions of the work, and therefore abstract in the same way. Coreference is also marked for each matching citation number:

  • … has been shown in the past ( [17]abstract ) Other studies disagreed … (see ←coref– [17]abstract )
  • In citation numbers in square brackets, only the number is part of the entity span, and square brackets are left outside the entity, since for multiple references we can get: [13, 17, 18]. But even for a single reference ”[4]“, only the number token is taken as the entity span for consistency.

Indefinite pronouns referring to verbs

Indefinite pronouns such as 'something' are only annotated if they refer to nominals. Cases referring to verbs can be identified for example by coordination with verbs:

  • It's messed up or something. (no annotation of 'something', which is coordinated with 'messed up', a verbal phrase)

Specific non-referential NPs

The following examples are not considered referential NPs:

  • “Every time (that)…” (i.e. 'every time' meaning 'when' or 'always')
  • “all the time”
  • “on a daily basis”
  • ”(in) line (with)“ (not an instance of a 'line')
  • “takes time”

Information status

  • Information status has the following values:
    • auto - same as giv/new below, assigned automatically based on position in coreference chain
    • new - not mentioned before, first mention ('auto' may be used instead)
    • giv - mentioned before, must be linked to previous mention ('auto' may be used instead)
    • acc - accessible - not mentioned before, but immediately available, requiring no introduction. This includes:
      • generics - [the sun]acc, [the world]acc
      • indexical expressions - [I]acc, [you]acc, [here]acc, [this]acc (when pointing to something)
      • absolute time expressions - [2016]acc, [October]acc, [this Friday]acc
      • countries taken as requiring no introduction - in [the US]acc today …
      • bridging - The company … [The CEO]acc (in this case, a bridging link must be made, see Coreference below)
    • split - indicates that a referent is given via previous mention of multiple non-adjacent parts, e.g. John … Mary … [they]split
  • Do not overuse the accessible generic category: not every definite NP is accessible if it is the first mention in the chain. Some examples that not considered accessible:
    • [an Officer in [the [United States]acc Air Force]new]new – The US is considered accessible by the major country guideline, and the officer is new (and indefinite). Although the speaker may assume that we know what 'US Air Force' is, it needed to be introduced into the discourse model, much like the introduction of a proper name that we know.
    • In the same way, the first mention of [Barack Obama]new is tagged as new, even though his identity is available to us. The idea is that newness refers to the introduction into the discourse model.
  • Personal pronouns that are inferable in the situation (I/me, you etc.) are accessible the first time they are mentioned. They are subsequently tagged as ‘giv’, since they have already been referred to explicitly.
  • Information status for cataphors: see Coreference below.

Entity Type

There are 10 entity type:

  • person - any person, including fictitious figures, groups of people, and semi-human entities (Pinocchio)
  • place - a country (Iceland), region (Sahara)), or other place being referred to as a location (the factory - when used as a place, not to refer to the physical building)
  • organization - a company, government, sports team and others
  • object - a concrete tangible object
  • event - includes reference to nouns ('War', 'the performance') and clauses that are referred back to ('that John came')
  • time - dates, times of day, days, years…
  • substance - water, mercury, gas, poison … includes context-dependent substances, such as Skittles or baking chocolate
  • animal - any animal, potentially including bacteria, aliens and others construed as animals
  • plant - interpreted broadly to include fruits, seeds and other living plant parts, but not substances (e.g. 'wood' is not classified as a plant)
  • abstract - abstract notions (luck), emotions (excitement) or intangible properties (predisposition)

Special notes on entity types

With the exception of bridge relations, two coreferring markables must have the same entity type. This can be tricky when two markables seem to fall into two different categories. For instance, the owners of Steve's Bar may describe it as both their [business]organization and [a dive bar that locals frequent]place. But Steve's Bar is ultimately an organization, and so all markables will have organization entities.

organization

  • Names of businesses are generally tagged as organizations, even if they are used to indicate a location:
    • “We bought it at [Macy's]organization
  • This guideline does not apply to commercial locations used as places, such as: “at [the mall]place

place

  • Websites may often be considered places:
    • You can get more details at [mywebsite.com]place
    • We met on [Facebook]place

time

  • Indexical items like today, yesterday etc., are taken as (accessible) time terms

abstract

  • Titles of or references to authored works are abstract. Titles are coreferent to the work.
    • [The Bible]abstract vs. [this specific Bible]object

Coreference

The coreference scheme is loosely based on the design principles of the OntoNotes coreferece scheme (Weischedel et al. 2012) but with more specific relation types, inspired by the TüBa-D/Z coreference scheme (Teljohann et al. 2012). A major design principle is that coreference should serve to identify the discourse referent referred to by underspecified expressions such as pronouns, and allow us to track the behavior of discourse referents as their expressions evolve over the course of a discourse.

There are two major types of coreference links: coreference proper, and bridging anaphora. Coreference contains four different subtypes of cases which are automatically derived from the 'coref' type, and bridging covers at least three types of cases:

  • coreference
    • ana - anaphoric, a pronoun referring back to something: [the woman] ←ana– [she]. This is automatically generated from the 'coref' type when the anaphor is a pronoun.
    • cata - cataphotic, a pronoun referring forward to something: [it]'s impossible [to know] ([it]–cata→[to know]). Automatically generated when the first member of a chain is a non-accessible pronoun.
    • appos - apposition, same as in syntax: [Your neighbor],←appos– [the lawyer] came by earlier. Generated automatically
    • lexical coref - all types of coreference, including lexical mention: [Obama] …. ←coref– [President Obama]

from coref using the syntax trees.

  • bridge
    • bridging proper - some inferrable part-whole relationship, which requires no introduction for the anaphor thanks to the antecedent: [a car] ←bridge– [the driver]
    • non co-referential anaphora - cases in which the bridged anaphor is not part of the antecedent, but is underspecificed can only be interpreted thanks to mention of the antecedent: [a Chinese restaurant] ←bridge– [an Italian one]
    • split antecedent: [John] met [Mary] ←bridge– [They] took a can together (in these cases the anaphor has multiple antecedents, but coreference only applies between the last mention and all previous mentions)

Specific guidelines

Apposition

  • Ages specified after a person's name are considered appositional, following the OntoNotes guidelines. The idea is that a phrase like ”[Mr. Smith], [43]“ is something like:
    • ”[Mr. Smith]person, [(a) 43 (year old)]person“.
  • The entity type is therefore also person for both markables in this case.
  • Note that other mentions of ages are abstract, including in ”[I] was [16]abstract
  • In cases where two full NP realizations of the entity are separated by a coordination such as 'and/or', the normal coref type is used, even if there is a subsequent apposition:
    • [My friend]←coref- and also [my hero]←coref-, [Mrs. Smith].
  • If two mentions share an article, they are no longer separate NPs, and they become one markable according to markable recognition rules:
    • [My friend and hero] ←coref- [Mrs. Smith]

Bridging

Bridging occurs when two entities do not corefer exactly, but the basis for the identifiability of one referent is the previous mention of one or more previous referents. This can be because the second referent forms part of the whole described by the antecedent, or because multiple referents are aggregated into a larger referring expression (see examples below).

  • If the second referent designating a part or other predictable component of the first referent contains an explicit possessive, the possessive itself should be linked to the first phrase, and no bridging relation needs to be added (since the possessive coreference is explicit).
  • In the case of inferrable parts, the new referent is viewed as ‘accessible’ (by way of bridging).
  • Aggregate referents, i.e. group referents (Mary, Jake → they) are viewed as ‘split’ at the anaphor, and the relationship is 'bridge'.
  • Examples:
    • It was [a beautiful statue]. [The head]object|acc was made of marble.
    • [Endeavour] and [Atlantis] await a journey on [their]object|split respective launchpads.
Not tagged as bridging
  • part with explicit possessive:
    • [The woman] raised [[her] hands]. (‘her’ is anaphoric to ‘the woman’, ‘her hands’ is not linked as bridging, though it can still be a referent)
  • Constructions with 'other', 'similar', 'different' or similar relational adjectives with an explicit noun:
    • [one kind of wine] … [another kind of wine] (note that if previously 'wine' is mentioned, both kinds of wine should bridge to that separately, but the two distinct 'one wine' / 'another wine' should not bridge to each other since they are no in a part-whole relationship (unlike 'all wines' ←bridge- 'this specific wine')
    • But note that without the noun (e.g. “one” anaphora), we do annotate bridging: [A great restaurant] … [a different one]
  • generic 'you' and 'us': cases of generic 'you' are not considered subsets of the aggregation in generic 'us', though each of these can have a group (you… you). Example: “If [you] want to [you] can only buy the tickets in person. This is annoying for all of [us], but it's the only way” (no bridging or coref from 'us' to 'you', but coref from [you] to [you])

Anaphora

  • Multiple mentions of ‘I’, ‘me’, ‘you’, ‘your’, ‘mine’ etc. are linked via the coref relationship (subtype ana), just like 3rd person pronouns: [I] ←coref– [me]. In a conversation, one person's ‘I’ may corefer with a ‘you’ used by another interlocutor.
  • Instances of ‘I/me’ that are coreferent with a ‘you’ coming from the other speaker in the dialog are considered linked, via the ‘coref’ relationship like other pronoun chains: He went with [you]? ←coref– [I] went alone.
  • Pronominal 'one' is linked as ana in both generic uses ([one] usually likes [one's] house) and substitutive uses if strictly coreferent ([which one] did you get? [This one].). In a partitive context, bridge should be used (I have [beer]. Give me [one]; note that 'one' is a subset of the beer).
  • The indexical adverbs ‘here’ and ‘there’, when they have an explicit antecedent (e.g. 'your new place') qualify as pronouns for the purpose of coreference type, since their interpretation depends completely on the antecedent. The relation is therefore labeled coref (ana).
  • The reciprocal reflexive phrases 'each other' are regarded as anaphoric. They are linked with either one coref relation if an aggregate plural mention already exists, or with bridge relations, if the components of 'each other' have only been mentioned separately so far. The information status is split in the latter case, otherwise giv.

Cataphora

Cataphora are pronominal or otherwise underspecified elements (including e.g. ‘those’) that precede an occurrence of a non-pronominal element that occurs within the same utterance and resolves their discourse referent.

Cataphora may be annotated in copula sentences as linking to their predicate, if the reference of the pronoun is otherwise unresolvable (see example below). Unlike other relations, cataphora point forwards, from the pronoun to the expression that resolves them. Examples:

  • [It]’s important [to brush your teeth]
  • In [her] address, [the chairwoman] said…

Subject in copula sentence:

  • [it]’s [a long game]
  • [This] was [a very difficult decision]

Information status for cataphors follows the value of their coreferent (the following mention). Thus both a cataphor and its subsequent mention may be considered new.

Coref

The coref type is used for all types of lexical coreference. Some specific tricky cases that ARE included are:

  • Distributive 'each' phrases, when the 'each' phrase ultimately covers the entire set of referents in another mention. For example:
    • We invited [two groups], ←coref- [each group] paid separately. (although 'each group' is singular, ultimately the predicate 'paid' applies to both groups, and the phrase 'each group' can be considered to cover the same set denotation as 'two groups')
gum/entities.txt · Last modified: 2020/11/13 14:42 by amir