Search interfaces

The lab maintains two corpus search interfaces, which offer students and the general public access to language data and statistical analysis tools, as well as an online dictionary:

NLP tools

We develop a number of NLP tools that help to build corpora automatically, or feed into manual correction loops:

  • xrenner - multilingual non-named entity and coreference resolution
  • RFTokenizer - a trainable segmenter for morphologically rich languages
  • Coptic NLP - a complete pipeline for processing Coptic data
  • HebPipe - an NLP pipeline for Hebrew

Annotation tools

We provide a number of freely available annotation tools:

  • rstWeb - open source web interface for Rhetorical Structure Theory annotation
  • GitDox - a version controlled, online XML and spreadsheet editor with built-in validation
  • DepEdit - configurable rule-based editing for dependency corpora in the conll-u format


Several of our corpora are freely available, open source projects:

Featured research


What's new in GUM?

The first release of GUM series 5 in 2019 brings several new features to our multilayer corpus - this post outlines the most important additions.

New features in our Coptic NLP pipeline

New features in our Coptic NLP pipeline

Coptic Scriptorium’s Natural Language Processing (NLP) tools now support two new features...

RNN reads newspaper for discourse signals

A neural network reads the newspaper...

... in search of discourse signals! We now know a lot about what cues people use to identify discourse relations, but can we teach computers to notice the same signals?

More research