Dependency annotation generally follows Stanford Typed Dependencies, using the Basic, non-collapsed dependency inventory (prep → pobj, no crossing edges), as described in de Marneffe & Manning (2013). One frequent later addition adopted in this corpus too is the 'vocative' label, which attaches to the root of the sentence or utterance.
Instructions for some special cases follow below the list of labels.
The copula 'be' appears primarily in three constructions:
In the normal predicative construction, the nominal predicate 'B' is the root, and 'A' is the nsubj. The verb 'be' itself is a dependent of the predicate B and takes the label cop.
In the existential construction 'there is A', the verb 'be' is taken to mean 'exist', and is labeled as the root. The subject is A (nsubj) and expletive 'there' is labeled expl.
When the predicate is a prepositional phrase, the convention is to analyze 'be' as the root, taking the predicate's preposition as a prep, which in turn has a normal pobj. The motivation is to not have to interpret whether 'be' is used existentially with a locative, or in some other sense more like the 'A is B' construction.
Dates with multiple coreferent parts are handled as appositions (appos). For example, “Sunday, the 13th”, constitutes two mentions of the same day. By 'rule of first dibs', the apposition goes from 'Sunday' to '13th'. Months with dates are treated as nn, i.e. 'October 15' is a type of '15' (note that October 15 is an instance of 'day', not 'month'). Years added to dates are seen as temporal modifiers of the day expression, and are labeled as tmod.
Image credits of the type: 'Image: XYZ' are seen as an individual construction and not analyzed as parataxis or nominal predication (root+nsubj). Instead, the convention is to use the dep label to point from the first part ('image') to the head of the second part. This avoids counting these constructions when searching e.g. for subjects or nominal sentences.
The same logic applies to quotation attribution with a speech verb. For example, in:
“To be or not to be” – Hamlet
The root is in the quotation, and 'Hamlet' is attached to that as dep. This is not the guideline if a speech verb is present, i.e. 'said' is the root in:
“To be or not to be”, said Hamlet.
Although the proper noun tag is applied even to (capitalized) adjectives in complex names, syntactic analysis should still treat them as adjectives etc. The rationale is that the POS tag can help find names, while a function label such as amod allows us to identify the internal structure of the name in question.
For complex personal name, we make the last name be the head, and everything else is nn to that:
The apparent 'double object' construction with 'make' and similar verbs is given a small clause type of analysis, wherein the object of the verb 'make' is seen as the subject of an embedded predication. In other words, 'make A a B' is analyzed as making it so, that A be a B. As a result, the analysis uses the xcomp label emanating from 'make' to signify that the accusative object of 'make' is the same as the subject of the clausal predicate, but the 'thing being made' is internally labeled as the subject of the small clause predication. This can be seen in the image below:
Another way of thinking of this is that the analysis means: make(that woman is the president)
Verbs such as 'let' in “let someone do something” or 'allow' in “allow A to do B” are analyzed as governing an xcomp clause, where the noun following the verb acts the subject of the subordinate small clause, not as the object of 'let', etc.
Verbs like 'call' or 'name' appear to take a double accusative object, e.g. “John called [Mary] [a saint]”. This makes it hard to distinguish the name argument from the named theme argument. The guidelines instead favor a different analysis using xcomp. The idea is that the naming action creates a small clause with the named as subject and the name as predicate: John performs a naming act, whose content is: “Mary is a saint”.
In some cases, a whole phrase can be used in place of a single word, e.g. as a compound modifier. In these cases, the complex modifier should be analyzed internally, and its local root is still attached to token it modifies with the normal label.
In the example, 'what to buy' is an infinitive + object with an internal analysis, but it functions much like a compound modifier (cf. 'the shopping section'). For this reason, it is attached at its head (the verb) with the function nn.
In adverbial clauses, the subordinating conjunction is labeled as 'mark' by convention for 'if' and 'whether' clauses. However other conjunctions have an adverbial function within the clause. Much like a direct object 'whom' is not a mark but still dobj inside a relative clause, adverbial conjunctions with temporal or locative meaning, as well as manner adverbials, are labeled advmod inside subordinate clauses. This applies to 'when', 'where' and 'how', paralleling such adverbs as 'then', 'there', and 'thus'.
Adverbial infinitive clauses, such as purpose clauses, which are not an argument of their embedding clause predicate, are advcl, not xcomp (since they are not complements). A common test to distinguish these is whether or not we can insert 'in order to':
See also the guideline for 'in order to' below.
Comparative adjectives that take 'than' dominate the word 'than' as prep. For analytic comparatives, the word 'more' is seen as advmod to the lexical adjective, and 'than' is governed by the lexical adjective as well (e.g. in 'more expensive than…', expensive governs the other two words).
Sentence initial coordinating conjunctions are attached to the root, pointing backwards, with the cc function.
Raising verbs appear to take a subject that actually belongs to a subordinate predicate semantically. This can be identified by alternations such as “John seems sick” vs “it seems John is sick” or “I happen to own a boat” vs. “It so happens I own a boat”. In both cases, the subject is predicated on in the embedded predicate (e.g. happens(I own a boat), not happen(I), or “I happen”). In these cases, the subject is attached to the subordinate predicate, and the main predicate dominates the subordinate predicate. If the subordinate predicate is an infinitive, it is labeled xcomp, but if it's a full subject-predicate finite clause, it's ccomp. This allows both constructions (with/without 'it') to receive the same analysis with respect to who's the subject, as shown below.
Footnote markers (the footnote number) should be attached as dep to the root of the constituent that the footnote refers to. If the footnote refers to the entire sentence, then it attaches to the root. If the footnote refers to a smaller constituent, then its root is the source of the dep arrow.
'In order' is seen as a multi-word expression, which may or may not appear with 'to' (cf. 'in order that'). The function of 'in order' is mark and it is attached at the 'in'. The token 'order' is pointed at with mwe as shown below:
The verb of 'in order to' clause is attached as advcl to the main clause.
Subject clauses can be full finite clauses, as in “[that they came] annoyed me”. But the csubj label can also apply to gerund clauses, as in “[doing that] can cause trouble”. In both of these cases, the subordinate clause verb is labeled as csubj to the main clause predicate.
By default, if no other clear syntactic relation applies when an academic reference is supplied, it's root (usually a first author name) is attached to the root of the clause containing it as dep and the year is attached to the first author as tmod:
However if the citation has a distinct syntactic function, the first author is taken as the head and the function is assigned as usual, for example here as the dobj of the verb 'see':
References consisting only of a number, e.g. “”, function in the same way: the number is the head of the reference, and it is attached as dep to the local root unless it has another normal function (dobj, pobj, etc.)
Multiple adjacent references are considered to be coordinated, whether or not an explicit 'and' appears:
Ranges of references with a hyphen are treated as a prepositional “TO” phrase:
When the direct object of a saying verb is a quote, it is labeled as ccomp whether or not the quote is a full clause.
The exception is when the “X said” appears medially, in which case it is considered a parenthetical, with the verb of saying dependent on the speech's root as parataxis.
Verbs of saying can have two objects, direct (dobj) and indirect (iobj). Both are present in
In this case, Mary is the indirect object. It's important that, even if what is said is missing, the person being told is still iobj. For example, the following has iobj only:
For compound nouns generally written as one word, or as two words separated by a hyphen, that you feel have been incorrectly split apart, treat the relation as an nn.
If more than is modifying a quantity, then the lexical word is the head. more than is a modifier which is internally a mwe.
If more than is used to compare things (a is more than b), then it is not an mwe, reverting to prep/pobj.
The multi-word expression relation is used for certain multi-word idioms that behave as one function word. MWEs are always annotated head-initially.
The current list of mwes includes 32 expressions:
MWE dependencies should be limited to these specific expressions. If you have a word that seems to have been incorrectly split apart, such as with out, use goeswith instead. The head is what you feel is the “main” part of the word. goeswith should only be used as a last resort, when you feel like you have exhausted all other possible dependencies.