Projects

Search interfaces

The lab maintains two corpus search interfaces, which offer students and the general public access to language data and statistical analysis tools, as well as an online dictionary:

NLP tools

We develop a number of NLP tools that help to build corpora automatically, or feed into manual correction loops:

  • xrenner - multilingual non-named entity and coreference resolution
  • RFTokenizer - a trainable segmenter for morphologically rich languages
  • Coptic NLP - a complete pipeline for processing Coptic data
  • HebPipe - an NLP pipeline for Hebrew

Annotation tools

We provide a number of freely available annotation tools:

  • rstWeb - open source web interface for Rhetorical Structure Theory annotation
  • GitDox - a version controlled, online XML and spreadsheet editor with built-in validation
  • DepEdit - configurable rule-based editing for dependency corpora in the conll-u format

Featured research

RST referential heatmap

What you say where - a discourse heatmap

Does discourse structure constrain where we talk about what? Research on recurring mentions within discourse graphs shows back-reference is sensitive to the reasons why sentences and groups of sentences are uttered.

RNN reads newspaper for discourse signals

A neural network reads the newspaper...

... in search of discourse signals! We now know a lot about what cues people use to identify discourse relations, but can we teach computers to notice the same signals?