Search interfaces

The lab maintains two corpus search interfaces, which offer students and the general public access to language data and statistical analysis tools, as well as an online dictionary:

NLP tools

We develop a number of NLP tools that help to build corpora automatically, or feed into manual correction loops:

  • xrenner - multilingual non-named entity and coreference resolution
  • RFTokenizer - a trainable segmenter for morphologically rich languages
  • Coptic NLP - a complete pipeline for processing Coptic data

Annotation tools

We provide a number of freely available annotation tools:

  • rstWeb - open source web interface for Rhetorical Structure Theory annotation
  • GitDox - a version controlled, online XML and spreadsheet editor with built-in validation
  • DepEdit - configurable rule-based editing for dependency corpora in the conll-u format

Featured research

RST referential heatmap

What you say where - a discourse heatmap

Does discourse structure constrain where we talk about what? Research on recurring mentions within discourse graphs shows back-reference is sensitive to the reasons why sentences and groups of sentences are uttered.

RNN reads newspaper for discourse signals

A neural network reads the newspaper...

... in search of discourse signals! We now know a lot about what cues people use to identify discourse relations, but can we teach computers to notice the same signals?