Buliding New Language Models

Introduction

new: A starter model with minimal settings for UD data is now included in models/udx

Beyond the language models that are available in the xrenner distribution, you can build your own models, or modify existing models, by editing their configuration files and gathering lexical data for your target language or domain, as well as training stochastic classifiers.

A language model in xrenner is defined by a set of files in a directory under the models/ directory, or in a compressed archive containing such files, conventionally marked by the extension .xrm. For exmple, the default English language model is models/eng/, and different models, typically named using ISO 639-2 three letter language codes, can be invoked using the -m option. The following example invokes the German language model, named deu (for Deutsch, i.e. German):

> python xrenner.py -m deu infile.conll10

To locate this model, xrenner first checks whether you have supplied an absolute or relative model path (e.g. /path/to/model/deu.xrm, or my_subdir/deu/). If you have supplied just the model name, as in deu above, xrenner assumes that the model is located under the xrenner installation directory in the ./models/ sub-directory. If there is either a directory or a file called deu (or deu.xrm for a compressed model), that model will be used.

Most of the language model files listed below are optional, but some, listed below, must be included in each model (whether compressed or not). The main configuration file determining the general behavior for a language is config.ini. Different variant behaviors for the same language model, using the same lexical data, can be created as configuration overide profiles using the optional overide.ini file.

At a minimum, a model must include:

It is also highly recommended to include:

Language model files

Configuration files

config.ini

mandatory

depedit.ini

optional

override.ini

optional

Lexical data files

affix_tokens.tab

optional

antonyms.tab

optional

atoms.tab

optional

coref.tab

optional

coref_rules.tab

mandatory

entities.tab

optional (but highly recommended)

entity_deps.tab

optional

entity_heads.tab

optional (but highly recommended)

entity_mods.tab

optional

hasa.tab

optional

isa.tab

optional

names.tab

optional

new_modifiers.tab

optional

numbers.tab

optional

open_close_punctuation.tab

optional

pronouns.tab

optional (but highly recommended)

similar.tab

optional

stop_list.tab

optional