To use the default model on English language data, you can run the main script xrenner.py on a text file in the 10 column conll10 or conllu format like this:
> python xrenner.py infile.conll10
If you have other language models, you can use them like this, for example for German:
> python xrenner.py -m deu infile.conll10
The default output format is a simple inline SGML format, but other output formats are available, using the -o option:
> python xrenner.py -o conll infile.conll10
If your model includes alternative setting profiles in an override.ini file (or compressed model component), you can invoke these using the -x option, e.g. to change the default model from the OntoNotes scheme to the GUM scheme:
> python xrenner.py -x GUM infile.conll10
You can also turn on verbose mode using -v which will report performance speed, and use -t without an input file to run unit tests and confirm the system is working as expected. Other options include -r to turn off classifiers (faster, but less accurate for models using classifiers), and -d <FILENAME> to dump all coreference candidate pairs to a file as training data for a classifier (see Building classifiers)
You can use glob syntax to process multiple files in batch mode. This will be substantially faster than invoking the xrenner multiple times, since the lexical data will only be loaded once. For example, you can read all .conll10 files in a directory like this:
> python xrenner.py *.conll10
In batch mode, output file names are automatically generated using the input file name, minus extensions like ‘conll10’ or ‘conllu’, and suffixed with the output format extension. For PAULA, document directories with names corresponding to the input documents are generated automatically.
If you have multiple cores available, batch mode works much faster by using multiple processes with the option -p, for example:
> python xrenner.py -x GUM -p 4 *.conll10
Importing as a module¶
You can import xrenner as a module. It may be convenient to just install xrenner via pip in this scenario:
> pip install xrenner
Then you can import the Xrenner object and feed it a string containing a 10 column conll format parse (see below on formats):
from xrenner import Xrenner xrenner = Xrenner() # Get a parse in basic Stanford Dependencies (not UD) my_conllx_result = some_parser.parse("John visited Spain. His visit went well.") sgml_result = xrenner.analyze(my_conllx_result,"sgml") print(sgml_result)
Keep in mind that the parser output must match whatever annotation scheme the xrenner model is expecting (tags, label names, head-dependent conventions, etc.)
xrenner uses the 10 column tab delimited conll format, with one line per token and a blank line between sentences. All of the following columns should be included (use an underscore for missing values):
- ID - token ID within sentence
- text - token text
- lemma - dictionary entry for this token (optional)
- pos - part of speech
- cpos - coarse or alternate part of speech (optional)
- morph - morphological information for this token (optional)
- head - ID of head token
- func – dependency function
- – 10. – reserved for alternate trees with multiple parentage (DAGs)
1 Wikinews Wikinews NP NNP _ 2 nsubj _ _ 2 interviews interview VVZ VBZ _ 0 root _ _ 3 President president NN NN _ 2 dobj _ _ 4 of of IN IN _ 3 prep _ _ 5 the the DT DT _ 7 det _ _ 6 International international NP NNP _ 7 amod _ _ 7 Brotherhood brotherhood NP NNP _ 4 pobj _ _ 8 of of IN IN _ 7 prep _ _ 9 Magicians magician NPS NNPS _ 8 pobj _ _ 1 Wednesday Wednesday NP NNP _ 0 root _ _ 2 , , , , _ 0 punct _ _ 3 October October NP NNP _ 4 nn _ _ 4 9 9 CD CD _ 1 appos _ _ 5 , , , , _ 0 punct _ _ 6 2013 2013 CD CD _ 3 tmod _ _
It is also possible to include comments on lines beginning with the pound sign. These are generally ignored, with the exception of optional sentence type (s_type) and speaker annotations, as in the example below, which can be used as part of coref_rules.tab (e.g. specifying having the same or different speakers as a condition):
# speaker="Mario J. Lucero" # s_type="decl" 1 Heaven _ NNP NNP _ 2 nn _ _ 2 Sent _ NNP NNP _ 3 nn _ _ 3 Gaming _ NNP NNP _ 6 nsubj _ _ 4 is _ VBZ VBZ _ 6 cop _ _ 5 basically _ RB RB _ 6 advmod _ _ 6 me _ PRP PRP _ 0 root _ _ 7 and _ CC CC _ 6 cc _ _ 8 Isabel _ NNP NNP _ 6 conj _ _ 9 , _ , , _ 0 punct _ _ 10 I _ PRP PRP _ 14 nsubj _ _ 11 'm _ VBP VBP _ 14 cop _ _ 12 Mario _ NNP NNP _ 14 nn _ _ 13 J. _ NNP NNP _ 14 nn _ _ 14 Lucero _ NNP NNP _ 6 parataxis _ _ 15 . _ . . _ 0 punct _ _ # speaker="Isabel Ruiz" # s_type="decl" 1 And _ CC CC _ 6 cc _ _ 2 , _ , , _ 0 punct _ _ 3 I _ PRP PRP _ 6 nsubj _ _ 4 'm _ VBP VBP _ 6 cop _ _ 5 Isabel _ NNP NNP _ 6 nn _ _ 6 Ruiz _ NNP NNP _ 0 root _ _ 7 . _ . . _ 0 punct _ _
Using the -o flag, the following output formats are supported:
This is the default output format. Each line is either a token or an opening or closing entity tag.
<referent id="referent_197" entity="person" group="34" antecedent="referent_142" type="coref"> Mrs. Hills </referent> said that <referent id="referent_198" entity="place" group="2" antecedent="referent_157" type="coref"> the U.S. </referent> is still concerned about `` disturbing developments in <referent id="referent_201" entity="place" group="20" antecedent="referent_193" type="coref"> Turkey </referent> and continuing slow progress in <referent id="referent_203" entity="place" group="20" antecedent="referent_201" type="coref"> Malaysia </referent> . '' <referent id="referent_204" entity="person" group="34" antecedent="referent_197" type="ana"> She </referent> did n't elaborate
Standard conll coreference format, one token per line and numbered opening/closing brackets in a separate column to express groups. Note that this format only groups mentions but does not represent antecedents and chain types (anaphora, apposition etc.) directly.
1 Portrait _ 2 shot _ 3 of _ 4 Dennis (4 5 Hopper 4) 6 , _ 7 famous _ 8 for _ 9 his (4) 10 role _ 11 in _ 12 the _ 13 1969 _ 14 film _ 15 Easy _ 16 Rider _
OntoNotes .coref XML format. Coreference types are represented, but only entity groups are used (no exact coref chains).
Portrait shot of <COREF ID="4" ENTITY="person" INFSTAT="new">Dennis Hopper</COREF> , famous for <COREF ID="4" ENTITY="person" INFSTAT="giv" TYPE="ana">his</COREF> role in the 1969 film Easy Rider
PAULA XML is a highly expressive, graph-like stand off XML format. Because PAULA documents are directory structures with multiple files, there is no need to specify an output file (> outfile) when using PAULA output.
The output using PAULA preserves both the exact antecdent chain structure and coreference types, as well as the optional information status annotations designating first mention (new) and subsequent mentions (giv for ‘given’).
This is an internal format used only to generate test cases for xrenner unit tests.
This is a dummy format setting - run the analysis but produce no output.