Developers’ module documentation

Classes

Xrenner

class modules.xrenner_xrenner.Xrenner(model='eng', override=None, rule_based=False, no_seq=False)[source]
analyze(infile, out_format)[source]

Method to run coreference analysis with loaded model

Parameters:
  • infile – file name of the parse file in the conll10 format, or the pre-read parse itself
  • format – format to determine output type, one of: html, paula, webanno, conll, onto, unittest
Returns:

output based on requested format

analyze_markable(mark, lex)[source]

Find entity, agreement and cardinality information for a markable :param mark: The Markable object to analyze :param lex: the LexData object with gazetteer information and model settings :return: void

load(model='eng', override=None)[source]

Method to load model data. Normally invoked by constructor, but can be repeated to change models later.

Parameters:
  • model – model directory in models/ specifying settings and gazetteers for this language (default: eng)
  • override – name of a section in models/override.ini if configuration overrides should be applied
Returns:

None

process_sentence(tokoffset, sentence)[source]

Function to analyze a single sentence

Parameters:
  • tokoffset – the offset in tokens for the beginning of the current sentence within all input tokens
  • sentence – the Sentence object containing mood, speaker and other information about this sentence
Returns:

void

serialize_output(out_format, parse=None)[source]

Return a string representation of the output in some format, or generate PAULA directory structure as output

Parameters:
  • out_format – the format to generate, one of: html, paula, webanno, conll, onto, unittest
  • parse – the original parse input fed to xrenner; only needed for unittest output
Returns:

specified output format string, or void for paula

set_doc_name(name)[source]

Method to manually set the name of the document being processed, rather than deriving it from an input file name.

Parameters:name – string, the name to give the document
Returns:None

ParsedToken

class modules.xrenner_classes.ParsedToken(tok_id, text, lemma, pos, morph, head, func, sentence, modifiers, child_funcs, child_strings, lex, quoted=False, head2='_', func2='_')[source]

Markable

class modules.xrenner_classes.Markable(mark_id, head, form, definiteness, start, end, text, core_text, entity, entity_certainty, subclass, infstat, agree, sentence, antecedent, coref_type, group, alt_entities, alt_subclasses, alt_agree, cardinality=0, submarks=[], coordinate=False, agree_certainty='')[source]
extract_features(lex, antecedent=None, candidate_list=[], dump_position=False)[source]

Function to generate feature representation of markables or markable-antecedent pairs for classifiers

Parameters:
  • lex – the LexData object with gazetteer information and model settings
  • antecedent – The antecedent Markable potentially coreferring to self
  • candidate_list – The list of candidate markables under consideration, used to extract cohort size
  • dump_position – Whether document name + token positions are dumped for each markable to compare to gold
Returns:

dictionary of markable properties

CorefRule

class modules.xrenner_rule.CorefRule(rule_string, rule_num)[source]

ConstraintMatcher

class modules.xrenner_rule.ConstraintMatcher(constraint)[source]

LexData

class modules.xrenner_lex.LexData(model, xrenner, override=None, rule_based=False, no_seq=False)[source]

Class to hold lexical information from gazetteers and training data. Use model argument to define subdirectory under models/ for reading different sets of configuration files.

get_atoms()[source]

Function to compile atom list for atomic markable recognition. Currently treats listed persons, places, organizations and inanimate objects from lexical data as atomic by default.

Returns:dictionary of atoms.
get_filters(override=None)[source]

Reads model settings from config.ini and possibly overrides from override.ini

Parameters:override – optional section name in override.ini
Returns:filters - dictionary of settings from config.ini with possible overrides
static get_first_last_names(names)[source]

Collects separate first and last name data from the collection in names.tab

Parameters:names – The complete names dictionary from names.tab, mapping full name to agreement
Returns:[firsts, lasts] - list containing dictionary of first names to agreement and set of last names
get_func_substitutes()[source]

Function for semi-hard-wired function substitutions based on function label and dependency direction. Uses func_substitute_forward and func_substitute_backward settings in config.ini

Returns:list of compiled substitutions_forward, substitutions_backward
get_morph()[source]

Compiles morphlogical affix dictionary based on members of entity_heads.tab

Returns:dictionary from affixes to dictionaries mapping classes to type frequencies
get_pos_agree_mappings()[source]

Gets dictionary mapping POS categories to default agreement classes, e.g. NNS > plural

Returns:mapping dictionary
lemmatize(token)[source]

Simple lemmatization function using rules from lemma_rules in config.ini

Parameters:token – ParsedToken object to be lemmatized
Returns:string - the lemma
parse_coref_rules(rule_list)[source]

Reader function to pass coref_rules.tab into CorefRule objects in two lists: one for general rules and one also including rules to use when speaker info is available.

Parameters:rule_list – textual list of rules
Returns:two separate lists of compiled CorefRule objects with and without speaker specifications
process_morph(token)[source]

Simple mechanism for substituting values in morph feature of input tokens. For more elaborate sub-graph dependent manipultations, use depedit module

Parameters:token – ParsedToken object to edit morph feature
Returns:string - the edited morph feature
read_antonyms()[source]

Function to created dictionary from each word to all its antonyms in antonyms.tab

Returns:dictionary from words to antonym sets
read_delim(filename, mode='normal', atom_list_name='atoms', add_to_sums=False, sep=', ')[source]

Generic file reader for lexical data in model directory

Parameters:
  • filename – string - name of the file
  • mode – double, triple, quadruple, quadruple_numeric, triple_numeric or low reading mode
  • atom_list_name – list of atoms to use for triple reader mode
  • add_to_sums – whether to sum numbers from multiple instances of the same key
  • sep – separator for double_with_sep mode
Returns:

compiled lexical data, usually a structured dictionary or set depending on number of columns

read_isa()[source]

Reads isa.tab into a dictionary from words to lists of isa-matches

Returns:dictionary from words to lists of corresponding isa-matches

Modules

depedit

DepEdit - A simple configurable tool for manipulating dependency trees

Input: CoNLL10 or CoNLLU (10 columns, tab-delimited, blank line between sentences, comments with pound sign #)

Author: Amir Zeldes

xrenner_compatible

modules.xrenner_compatible.acronym_match(mark, candidate, lex)[source]

Check whether a Markable’s text is an acronym of a candidate Markable’s text

Parameters:
  • mark – The Markable object to test
  • candidate – The candidate Markable with potentially acronym-matching text
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.agree_compatible(mark1, mark2, lex)[source]

Checks if the agree property of two markables is compatible for possible coreference

Parameters:
  • mark1 – the first of two markables to compare agreement
  • mark2 – the second of two markables to compare agreement
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.best_candidate(markable, candidate_set, lex, rule, take_first=False)[source]
Parameters:
  • markable – markable to find best antecedent for
  • candidate_set – set of markables which are possible antecedents based on some coref_rule
  • lex – the LexData object with gazetteer information and model settings
  • propagate – string with feature propagation instructions from coref_rules.tab in lex
  • rule_num – the rule number of the rule producing the match in coref_rules.tab
  • clf_name – name of the pickled classifier to use for this rule, or “_default_” to use heuristic matching
  • take_first – boolean, whether to skip matching and use the most recent candidate (minimum token distance). This saves time if a rule is guaranteed to produce a unique, correct candidate (e.g. reflexives)
Returns:

Markable object or None (the selected best antecedent markable, if available)

modules.xrenner_compatible.entities_compatible(mark1, mark2, lex)[source]

Checks if the entity property of two markables is compatible for possible coreference

Parameters:
  • mark1 – the first of two markables to compare entities
  • mark2 – the second of two markables to compare entities
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.group_agree_compatible(markable, candidate, previous_markables, lex)[source]
Parameters:
  • markable – markable whose group the candidate might be joined to
  • candidate – candidate to check for compatibility with all group members
  • previous_markables – all previous markables which may need to inherit from the model/host
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.isa(markable, candidate, lex)[source]

Staging function to check for and store new cached isa information. Calls actual run_isa() function if pair is still viable for new isa match.

Parameters:
  • markable – one of two markables to compare lexical isa relationship with
  • candidate – the second markable, which is a candidate antecedent for the other markable
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.merge_entities(mark1, mark2, previous_markables, lex)[source]

Negotiates entity mismatches across coreferent markables and their groups. Returns True if merging has occurred.

Parameters:
  • mark1 – the first of two markables to merge entities for
  • mark2 – the second of two markables to merge entities for
  • previous_markables – all previous markables which may need to inherit from the model/host
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.modifiers_compatible(markable, candidate, lex, allow_force_proper_mod_match=True)[source]

Checks whether the dependents of two markables are compatible for possible coreference

Parameters:
  • markableMarkable one of two markables to compare dependents for
  • candidateMarkable the second markable, which is a candidate antecedent for the other markable
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.run_isa(markable, candidate, lex)[source]

Checks whether two markables are compatible for coreference via the isa-relation

Parameters:
  • markable – one of two markables to compare lexical isa relationship with
  • candidate – the second markable, which is a candidate antecedent for the other markable
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_compatible.score_match_heuristic(markable, candidate, features, lex)[source]

Basic fall-back function for heuristic match scoring when no classifier is available

Parameters:
  • makrable
  • candidate
  • features
Returns:

modules.xrenner_compatible.update_group(host, model, previous_markables, lex)[source]

Attempts to update entire coreference group of a host markable with information gathered from a model markable discovered to be possibly coreferent with the host. If incompatible modifiers are discovered the process fails and returns False. Otherwise updating succeeds and the update_group returns true

Parameters:
  • host – the first markable discovered to be coreferent with the model
  • model – the model markable, containing new information for the group
  • previous_markables – all previous markables which may need to inherit from the model/host
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

xrenner_coref

modules.xrenner_coref.antecedent_prohibited(markable, conll_tokens, lex)[source]

Check whether a Markable object is prohibited from having an antecedent

Parameters:
  • markable – The Markable object to check
  • conll_tokens – The list of ParsedToken objects up to and including the current sentence
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_coref.coref_rule_applies(lex, constraints, mark, anaphor=None)[source]

Check whether a markable definition from a coref rule applies to this markable

Parameters:
  • lex – the LexData object with gazetteer information and model settings
  • constraints – the constraints defining the relevant Markable
  • mark – the Markable object to check constraints against
  • anaphor – if this is an antecedent check, the anaphor is passed for $1-style constraint checks
Returns:

bool: True if ‘mark’ fits all constraints, False if any of them fail

modules.xrenner_coref.find_antecedent(markable, previous_markables, lex, restrict_rule='')[source]

Search for antecedents by cycling through coref rules for previous markables

Parameters:
  • markable – Markable object to find an antecedent for
  • previous_markables – Markables in all sentences up to and including current sentence
  • lex – the LexData object with gazetteer information and model settings
  • restrict_rule – a string specifying a subset of rules that should be checked (e.g. only rules with ‘appos’)
Returns:

candidate, matching_rule - the best antecedent and the rule that matched it

modules.xrenner_coref.search_prev_markables(markable, previous_markables, rule, lex)[source]

Search for antecedent to specified markable using a specified rule

Parameters:
  • markable – The markable object to find an antecedent for
  • previous_markables – The list of know markables up to and including the current sentence; markables beyond current markable but in its sentence are included for cataphora.
  • ante_constraints – A list of ContraintMatcher objects describing the antecedent
  • ante_spec – The antecedent specification part of the coref rule being checked, as a string
  • lex – the LexData object with gazetteer information and model settings
  • max_dist – Maximum distance in sentences for the antecedent search (0 for search within sentence)
  • propagate – Whether to progpagate features upon match and in which direction
Returns:

the selected candidate Markable object

xrenner_marker

modules.xrenner_marker.assign_coordinate_entity(mark, markables_by_head)[source]

Checks if all constituents of a coordinate markable have the same entity and subclass and if so, propagates these to the coordinate markable.

Parameters:
  • mark – a coordinate markable to check the entities of its constituents
  • markables_by_head – dictionary of markables by head id
Returns:

void

modules.xrenner_marker.construct_modifier_substring(modifier)[source]

Creates a list of tokens representing a modifier and all of its submodifiers in sequence

Parameters:modifier – A ParsedToken object from the modifier list of the head of some markable
Returns:Text of that modifier together with its modifiers in sequence
modules.xrenner_marker.disambiguate_entity(mark, lex)[source]

Selects prefered entity for a Markable with multiple alt_entities based on dependency information or more common type

Parameters:
  • mark – the Markable object
  • lex – the LexData object with gazetteer information and model settings
Returns:

predicted entity type as string

modules.xrenner_marker.get_mod_ordered_dict(mod)[source]

Retrieves the (sub)modifiers of a modifier token

Parameters:mod – A ParsedToken object representing a modifier of the head of some markable
Returns:Recursive ordered dictionary of that modifier’s own modifiers
modules.xrenner_marker.is_atomic(mark, atoms, lex)[source]

Checks if nested markables are allowed within this markable

Parameters:
  • mark – the Markable to be checked for atomicity
  • atoms – list of atomic markable text strings
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_marker.lookup_has_entity(text, lemma, entity, lex)[source]

Checks if a certain token text or lemma have the specific entity listed in the entities or entity_heads lists

Parameters:
  • text – text of the token
  • lemma – lemma of the token
  • entity – entity to check for
  • lex – the LexData object with gazetteer information and model settings
Returns:

bool

modules.xrenner_marker.markables_overlap(mark1, mark2, lex=None)[source]

Helper function to check if two markables cover some of the same tokens. Note that if the lex argument is specified, it is used to recognize possessives, which behave exceptionally. Possessive pronouns beginning after a main markable has started are tolerated in case of markable definitions including relative clauses, e.g. [Mr. Pickwick, who was looking for [his] hat]

Parameters:
  • mark1 – First Markable
  • mark2 – Second Markable
  • lex – the LexData object with gazetteer information and model settings or None
Returns:

bool

modules.xrenner_marker.parse_entity(entity_text, certainty='uncertain')[source]

Parses: entity -tab- subclass(/agree) + certainty into a tuple

Parameters:
  • entity_text – the string to parse, must contain excatly two tabs
  • certainty – the certainty string at end of tuple, default ‘uncertain’
Returns:

quadruple of (entity, subclass, agree, certainty)

modules.xrenner_marker.pos_func_combo(pos, func, pos_func_heads_string)[source]
Returns:bool
modules.xrenner_marker.recognize_entity_by_mod(mark, lex, mark_atoms=False)[source]

Attempt to recognize entity type based on modifiers

Parameters:
  • markMarkable for which to identify the entity type
  • modifier_lexicon – The LexData object’s modifier list
Returns:

String (entity type, possibly including subtype and agreement)

modules.xrenner_marker.remove_infix_tokens(marktext, lex)[source]

Remove infix tokens such as dashes, interfixed articles (in Semitic construct state) etc.

Parameters:
  • marktext – the markable text string to remove tokens from
  • lex – the LexData object with gazetteer information and model settings
Returns:

potentially truncated text

modules.xrenner_marker.remove_prefix_tokens(marktext, lex)[source]

Remove leading tokens such as articles and other tokens configured as potentially redundant to citation form

Parameters:
  • marktext – the markable text string to remove tokens from
  • lex – the LexData object with gazetteer information and model settings
Returns:

potentially truncated text

modules.xrenner_marker.remove_suffix_tokens(marktext, lex)[source]

Remove trailing tokens such as genitive ‘s and other tokens configured as potentially redundant to citation form

Parameters:
  • marktext – the markable text string to remove tokens from
  • lex – the LexData object with gazetteer information and model settings
Returns:

potentially truncated text

modules.xrenner_marker.resolve_cardinality(mark, lex)[source]

Find cardinality for Markable based on numerical modifiers or number words

Parameters:
  • mark – The Markable to resolve agreement for
  • lex – the LexData object with gazetteer information and model settings
Returns:

Cardinality as float, zero if unknown

modules.xrenner_marker.resolve_entity_cascade(entity_text, mark, lex)[source]

Retrieve possible entity types for a given text fragment based on entities list, entity heads and names list.

Parameters:
  • entity_text – The text to determine the entity for
  • mark – The Markable hosting the text fragment to retrieve context information from (e.g. dependency)
  • lex – the LexData object with gazetteer information and model settings
Returns:

entity type; note that this is used to decide whether to stop the search, but the Markable’s entity is already set during processing together with matching subclass and agree information

modules.xrenner_marker.resolve_mark_agree(mark, lex)[source]

Resolve Markable agreement based on morph information in tokens or gazetteer data

Parameters:
  • mark – The Markable to resolve agreement for
  • lex – the LexData object with gazetteer information and model settings
Returns:

void

modules.xrenner_marker.resolve_mark_entity(mark, lex)[source]

Main function to set entity type based on progressively less restricted parts of a markable’s text

Parameters:
  • mark – The Markable object to get the entity type for
  • lex – the LexData object with gazetteer information and model settings
Returns:

void

xrenner_postprocess

Postprocessing module. Alters results of coreference analysis based on model settings, such as deleting certain markables or re-wiring coreference relations according to a particular annotation scheme

Author: Amir Zeldes and Shuo Zhang

modules.xrenner_postprocess.kill_zero_marks(markables, markstart_dict, markend_dict)[source]

Removes markables whose id has been set to 0 in postprocessing

Parameters:
  • markables – All Markable objects
  • markstart_dict – Dictionary of token span start ids to lists of markables starting at that id
  • markend_dict – Dictionary of token span end ids to lists of markables ending at that id
Returns:

void

xrenner_preprocess

modules/xrenner_preprocess.py

Prepare parser output for entity and coreference resolution

Author: Amir Zeldes

modules.xrenner_preprocess.add_child_info(conll_tokens, child_funcs, child_strings, lex)[source]

Adds a list of all dependent functions and token strings to each parent token

Parameters:
  • conll_tokens – The ParsedToken list so far
  • child_funcs – Dictionary from ids to child functions
  • child_strings – Dictionary from ids to child strings
Returns:

void

modules.xrenner_preprocess.add_negated_parents(conll_tokens, offset)[source]

Sets the neg_parent property on tokens whose head dominates a negation

Parameters:
  • conll_tokens – token list for this document
  • offset – token ID reached in last sentence
Returns:

None

modules.xrenner_preprocess.replace_conj_func(conll_tokens, tokoffset, lex)[source]

Function to replace functions of tokens matching the conjunction function with their parent’s function

Parameters:
  • conll_tokens – The ParsedToken list so far
  • tokoffset – The starting token for this sentence
  • lex – the LexData object with gazetteer information and model settings
Returns:

void

xrenner_propagate

modules/xrenner_propagate.py

Feature propagation module. Propagates entity and agreement features for coreferring markables.

Author: Amir Zeldes

modules.xrenner_propagate.propagate_agree(markable, candidate)[source]

Progpagate agreement between to markables if one has unknown agreement

Parameters:
  • markable – Markable object
  • candidate – Coreferent antecdedent Markable object
Returns:

void

modules.xrenner_propagate.propagate_entity(markable, candidate, direction='propagate')[source]

Propagate class and agreement features between coreferent markables

Parameters:
  • markable – a Markable object
  • candidate – a coreferent antecedent Markable object
  • direction – propagation direction; by default, data can be propagated in either direction from the more certain markable to the less certain one, but direction can be forced, e.g. ‘propagate_forward’
Returns:

void

Unit tests

class modules.xrenner_test.Test1Model(methodName='runTest')[source]
classmethod setUpClass()[source]
classmethod tearDownClass()[source]
test_model_files()[source]
class modules.xrenner_test.Test2MarkableMethods(methodName='runTest')[source]
classmethod setUpClass()[source]
classmethod tearDownClass()[source]
test_atomic_mod()[source]
test_name()[source]
class modules.xrenner_test.Test3CorefMethods(methodName='runTest')[source]
classmethod setUpClass()[source]
test_affix_morphology()[source]
test_appos_envelope()[source]
test_cardinality()[source]
test_dynamic_hasa()[source]
test_entity_dep()[source]
test_hasa()[source]
test_isa()[source]
test_verbal_event_stem()[source]
class modules.xrenner_test.Case(case_string)[source]