RST Signaling Guidelines
Lexical Chains
Lexical chains are annotated for words with the same lemma. For example:
Note how 'rate-based' is not considered part of the lexical chain.
In some cases, lexical_chain is annotated for synonyms or other non-identical terms. In such cases, if the similar words can be identified, we annotate lexical_chain as usual, but add 'non_ident' in the notes column. For example:
Reported speech
Annotations not included in our scheme
Choosing source and target
For satellite-nucleus relations, the satellite is the source and the nucleus is the target
For multinucs, the child is the source, and the non-terminal multinuc node is the target
Position of the annotation
Normal anchored signals are placed on all signalling words in one contiguous span if possible, otherwise, multiple contiguous spans
Discontinuous spans receive an automatically increased co-index in a column 'discontinuous' (e.g. all parts receive coindex 1, then next discontinuous item receives 2 … 2, etc.)
The special co-index 0 is used for non-discontinuous annotations that share a row with a discontinuous annotations (e.g. '3|0', marking a line sharing discontinuous index '3' and a second, non-discontinuous annotation signified by '0').
'0' is also used for all other annotations that would otherwise be empty when a '|' is used to separate multiple annotations. For example, if we have 'note' applying to one of two annotations, we use 'some_note|0' to indicate the note applying to the first annotation.
Unanchored signals are placed on the single first token after the position of the annotation
Multiple signals annotated at the same token are separated by pipe in ALL cells of that row's signaling annotation, including:
If a multiple token signal (e.g. several words on multiple row in GitDox) overlaps a smaller signal (e.g. single word), we split up the larger span into multiple identical annotations, since we can't use the '|' syntax for only part of the span.
morphological tense
'Tense' signals cover all aspects of tense, aspect and mood, including periphrastic constructions in English
It is not necessary to automatically annotate every verb in the source and target spans - only occurrences of tenses that matter for the relations being signaled should be annotated.
For all tense/aspect/mood signals, the entire verbal complex should be annotated, creating parity between simple lexical verbs, periphrastic tenses, and passives:
John [went] there (simple past - annotate just the verb)
John [had gone] there (periphrastic tense, annotate auxiliary and lexical verb)
John [was brought] there (passive, annotate auxiliary and lexical verb)
John [was] happy (non-verbal predicate, only the verb should be annotated)
Interpreted explicit signals
Some words are interpreted as explicit anchored signals of e.g. genre-based signaling. Examples:
Signal labels
Labels containing + for combined signals are always alphabetized (e.g. always semantic+syntactic, not syntactic+semantic)
We found a questionable distinction between 'past_participial_clause' and nominal_modifier, the former is used used in a non-restrictive vmod clause: [The average of interbank offered rates for dollar deposits in the London market] [based on quotations at five major banks .]
Correcting annotation errors
If the Signaling Corpus contains a clear annotation error, we do not include that signal, but add a note structured as follows: rem:TYPE:SIGNAL
. For example: rem:semantic:lexical_chain