What you say where -
a discourse heatmap

RST referetial heatmap — A heatmap of back reference likelihood for a pronoun in the blue unit

Does discourse structure constrain where we talk about what? Research on recurring mentions within discourse graphs shows back-reference is sensitive to the reasons why sentences and groups of sentences are uttered. In the image above, we fixate on a target sentence in blue, and predict how likely it is that a pronoun inside it will have an antecedent in some other unit (without peeking at the words!). Redder units are more likely to contain an antecedent.

There are several frameworks for describing discourse graphs - we use Rhetorical Structure Theory (Mann & Thompson 1988), which posits that text are "trees of clauses". Much like sentences form trees of words, discourse trees can lump together multiple clauses into complex discourse units, which often correspond to coherent sections or paragraphs in writing, and different discourse units have different functions. Instead of function like 'subject' and 'object', discourse units have fiunctions such as expressing 'background information', 'evidence' or 'cause'. The example below shows how these relations coalesce to form a coherent discourse.

RST referetial heatmap — Example analysis of a discourse fragment in RST (source: RST website)

The idea that discourse trees constrain what can be referred to where goes back at least to Polanyi's (1988) 'stack of discourse units', determining entities available for pronominalization at each point (e.g. pronouns should) refer back to words inside some maximal complex unit. This was formalized in RST by Veins Theory (Cristea et al. 1998) usings Domain of Referential Accessibility (DRA): sub-parts of a discourse graph commanding a unit with a pronoun are legitimate regions for back reference.

Although subsequent work (Tetreault & Allen 2003) has shown that a categorical constraint forbidding pronouns from referring back to words outside their DRA is wrong, our recent work has shown that, when used quantitatively alongside other predictors, we can build predictive heatmaps like the one above. For more information see the paper linked below!

[full paper]

References

(), """.". . (), . In: . (.) : , . , .
[ link ] [ link ]
[DOI]

@Article{MannThompson1988,
											  title                    = {Rhetorical Structure Theory: Toward a Functional Theory of Text Organization},
											  author                   = {William C. Mann and Sandra A. Thompson},
											  journal                  = {Text},
											  year                     = {1988},
											  number                   = {3},
											  pages                    = {243-281},
											  volume                   = {8}
											}
@Article{Polanyi1988,
  author    = {Livia Polanyi},
  title     = {A Formal Model of the Structure of Discourse},
  journal   = {Journal of Pragmatics},
  year      = {1988},
  volume    = {12},
  number    = {5-6},
  pages     = {601-638}
}											
@InProceedings{CristeaIdeRomary1998,
  author    = {Dan Cristea and Nancy Ide and Laurent Romary},
  title     = {Veins Theory: A Model of Global Discourse Cohesion and Coherence},
  booktitle = {Proceedings of ACL/COLING},
  url = {http://www.aclweb.org/anthology/P98-1044},
  year      = {1998},
  pages     = {281-285},
  address   = {Montreal, Canada}
}
@InProceedings{TetreaultAllen2003,
  author    = {Joel Tetreault and James Allen},
  title     = {An Empirical Evaluation of Pronoun Resolution and Clausal Structure},
  booktitle = {Proceedings of the 2003 International Symposium on Reference Resolution and its Applications to Question Answering and Summarization},
  year      = {2003},
  pages     = {1-8},
  url = {https://www.cs.rochester.edu/~tetreaul/clause.pdf},
  address   = {Venice, Italy}
}

Updates

Our EACL 2024 paper promotes a strict definition of entity salience by presenting GUMsley, a 12-genre challenge dataset for entity salience evaluation and shows how salient entities added to summarization models are beneficial for deriving higher-quality summaries with fewer hallucinated entities

Check our AACL-IJCNLP 2023 paper about incorporating singletons and mention-based features to improve coreference generalization

Our SIGDIAL 2023 paper on English RST parsing errors examines and models some of the factors associated with parsing difficulties

Our LAW-XVII 2023 (co-located with ACL 2023) paper on a Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation presents GENTLE , a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of-domain evaluation and openly released as part of the Universal Dependencies 2.12 version available here

Our ACL 2023 Findings paper on Multi-Genre Data and Evaluation for English Abstractive Summarization presents a 12-genre challenge set for English abstractive summarization (the extreme summarization task) following both generall and genre-specific guidelines

More

Get in touch

If you have any questions or feedback please let us know! If you'd like to join us: We accept new PhD and Masters students every year, please contact Amir Zeldes for more information.

corpling@georgetown.edu
(+1) 202-687-6760
1421 37th St. NW
Washington, DC 20057