To use the default model on English language data, you can run the main script xrenner.py on a text file in the 10 column conll10 or conllu format like this:
> python xrenner.py infile.conll10
If you have other language models, you can use them like this, for example for German:
> python xrenner.py -m deu infile.conll10
The default output format is a simple inline SGML format, but other output formats are available, using the -o option:
> python xrenner.py -o conll infile.conll10
If your model includes alternative setting profiles in an override.ini file (or compressed model component), you can invoke these using the -x option, e.g. to change the default model from the OntoNotes scheme to the GUM scheme:
> python xrenner.py -x GUM infile.conll10
You can also turn on verbose mode using -v which will report performance speed, and use -t without an input file to run unit tests and confirm the system is working as expected. Other options include -r to turn off classifiers (faster, but less accurate for models using classifiers), and -d <FILENAME> to dump all coreference candidate pairs to a file as training data for a classifier (see Building classifiers)
You can use glob syntax to process multiple files in batch mode. This will be substantially faster than invoking the xrenner multiple times, since the lexical data will only be loaded once. For example, you can read all .conll10 files in a directory like this:
> python xrenner.py *.conll10
In batch mode, output file names are automatically generated using the input file name, minus extensions like ‘conll10’ or ‘conllu’, and suffixed with the output format extension. For PAULA, document directories with names corresponding to the input documents are generated automatically.
If you have multiple cores available, batch mode works much faster by using multiple processes with the option -p, for example:
> python xrenner.py -x GUM -p 4 *.conll10
Note that if your model uses neural models with large contextualized embeddings (e.g. Bert), multiple processes may not fit in memory when RAM is limited. For “could not allocate memory errors”, consider trying -p 1.
Importing as a module¶
You can import xrenner as a module. It may be convenient to just install xrenner via pip in this scenario:
> pip install xrenner
Then you can import the Xrenner object and feed it a string containing a 10 column conll format parse (see below on formats):
from xrenner import Xrenner xrenner = Xrenner() # Get a parse in Universal Dependencies my_conllu_result = some_parser.parse("John visited Spain. His visit went well.") sgml_result = xrenner.analyze(my_conllu_result,"sgml") print(sgml_result)
Keep in mind that the parser output must match whatever annotation scheme the xrenner model is expecting (tags, label names, head-dependent conventions, etc.)
xrenner uses the 10 column tab delimited conllu format, with one line per token and a blank line between sentences. All of the following columns should be included (use an underscore for missing values):
- ID - token ID within sentence
- text - token text
- lemma - dictionary entry for this token (optional)
- upos - universal part of speech
- xpos - native part of speech (for English: PTB)
- morph - morphological information for this token (optional)
- head - ID of head token
- func – dependency function
- – 10. – reserved for alternate trees with multiple parentage (DAGs) and misc. annotations
1 Wikinews _ PROPN NNP _ 2 nsubj _ _ 2 interviews _ VERB VBZ _ 0 root _ _ 3 President _ NOUN NN _ 2 obj _ _ 4 of _ ADP IN _ 7 case _ _ 5 the _ DET DT _ 7 det _ _ 6 International _ PROPN NNP _ 7 amod _ _ 7 Brotherhood _ PROPN NNP _ 3 nmod _ _ 8 of _ ADP IN _ 9 case _ _ 9 Magicians _ PROPN NNPS _ 7 nmod _ _ 1 Wednesday _ PROPN NNP _ 0 root _ _ 2 , _ PUNCT , _ 4 punct _ _ 3 October _ PROPN NNP _ 4 compound _ _ 4 9 _ NUM CD _ 1 appos _ _ 5 , _ PUNCT , _ 6 punct _ _ 6 2013 _ NUM CD _ 4 nmod:tmod _ _
It is also possible to include comments on lines beginning with the pound sign. These are generally ignored, with the exception of optional sentence type (s_type) and speaker annotations, as in the example below, which can be used as part of coref_rules.tab (e.g. specifying having the same or different speakers as a condition):
# speaker="Mario J. Lucero" # s_type="decl" 1 Heaven _ PROPN NNP _ 6 nsubj _ _ 2 Sent _ PROPN NNP _ 1 flat _ _ 3 Gaming _ PROPN NNP _ 1 flat _ _ 4 is _ AUX VBZ _ 6 cop _ _ 5 basically _ ADV RB _ 6 advmod _ _ 6 me _ PRON PRP _ 0 root _ _ 7 and _ CCONJ CC _ 8 cc _ _ 8 Isabel _ PROPN NNP _ 6 conj _ _ 9 , _ PUNCT , _ 12 punct _ _ 10 I _ PRON PRP _ 12 nsubj _ _ 11 'm _ AUX VBP _ 12 cop _ _ 12 Mario _ PROPN NNP _ 6 parataxis _ _ 13 J. _ PROPN NNP _ 12 flat _ _ 14 Lucero _ PROPN NNP _ 12 flat _ _ 15 . _ PUNCT . _ 6 punct _ _ # speaker="Isabel Ruiz" # s_type="decl" 1 And _ CCONJ CC _ 5 cc _ _ 2 , _ PUNCT , _ 1 punct _ _ 3 I _ PRON PRP _ 5 nsubj _ _ 4 'm _ AUX VBP _ 5 cop _ _ 5 Isabel _ PROPN NNP _ 0 root _ _ 6 Ruiz _ PROPN NNP _ 5 flat _ _ 7 . _ PUNCT . _ 5 punct _ _
Using the -o flag, the following output formats are supported:
This is the default output format. Each line is either a token or an opening or closing entity tag.
<referent id="referent_197" entity="person" group="34" antecedent="referent_142" type="coref"> Mrs. Hills </referent> said that <referent id="referent_198" entity="place" group="2" antecedent="referent_157" type="coref"> the U.S. </referent> is still concerned about `` disturbing developments in <referent id="referent_201" entity="place" group="20" antecedent="referent_193" type="coref"> Turkey </referent> and continuing slow progress in <referent id="referent_203" entity="place" group="20" antecedent="referent_201" type="coref"> Malaysia </referent> . '' <referent id="referent_204" entity="person" group="34" antecedent="referent_197" type="ana"> She </referent> did n't elaborate
Standard conll coreference format, one token per line and numbered opening/closing brackets in a separate column to express groups. Note that this format only groups mentions but does not represent antecedents and chain types (anaphora, apposition etc.) directly.
1 Portrait _ 2 shot _ 3 of _ 4 Dennis (4 5 Hopper 4) 6 , _ 7 famous _ 8 for _ 9 his (4) 10 role _ 11 in _ 12 the _ 13 1969 _ 14 film _ 15 Easy _ 16 Rider _
OntoNotes .coref XML format. Coreference types are represented, but only entity groups are used (no exact coref chains).
Portrait shot of <COREF ID="4" ENTITY="person" INFSTAT="new">Dennis Hopper</COREF> , famous for <COREF ID="4" ENTITY="person" INFSTAT="giv" TYPE="ana">his</COREF> role in the 1969 film Easy Rider
PAULA XML is a highly expressive, graph-like stand off XML format. Because PAULA documents are directory structures with multiple files, there is no need to specify an output file (> outfile) when using PAULA output.
The output using PAULA preserves both the exact antecdent chain structure and coreference types, as well as the optional information status annotations designating first mention (new) and subsequent mentions (giv for ‘given’).
Tab-delimited format used by the WebAnno and Inception annotation tools.
#FORMAT=WebAnno TSV 3.2 #T_SP=webanno.custom.Referent|entity|infstat #T_RL=webanno.custom.Coref|coreftype|BT_webanno.custom.Referent #Text=Charles J. Fillmore 1-1 0-7 Charles person new coref 2-1[2_1] 1-2 8-10 J. person new _ _ 1-3 11-19 Fillmore person new _ _ #Text=Charles J. Fillmore ( August 9 , 1929 – February 13 , 2014 ) was an American linguist and Professor of Linguistics at the University of California , Berkeley . 2-1 20-27 Charles person giv coref 2-16[9_2] 2-2 28-30 J. person giv _ _ 2-3 31-39 Fillmore person giv _ _
This is an internal format used only to generate test cases for xrenner unit tests.
This is a dummy format setting - run the analysis but produce no output. Useful when using the -d, –dump option for classifier training data.