Search interfaces

The lab maintains two corpus search interfaces, which offer students and the general public access to language data and statistical analysis tools, as well as an online dictionary:

NLP tools

We develop a number of NLP tools that help to build corpora automatically, or feed into manual correction loops:

  • xrenner - multilingual non-named entity and coreference resolution
  • RFTokenizer - a trainable segmenter for morphologically rich languages
  • Coptic NLP - a complete pipeline for processing Coptic data
  • HebPipe - an NLP pipeline for Hebrew

Annotation tools

We provide a number of freely available annotation tools:

  • rstWeb - open source web interface for Rhetorical Structure Theory annotation
  • GitDox - a version controlled, online XML and spreadsheet editor with built-in validation
  • DepEdit - configurable rule-based editing for dependency corpora in the conll-u format

Featured research


What's new in GUM?

The first release of GUM series 5 in 2019 brings several new features to our multilayer corpus - this post outlines the most important additions.

RNN reads newspaper for discourse signals

A neural network reads the newspaper...

... in search of discourse signals! We now know a lot about what cues people use to identify discourse relations, but can we teach computers to notice the same signals?

More research