PhraseMatcher
The PhraseMatcher
lets you efficiently match large terminology lists. While
the Matcher
lets you match sequences based on lists of token
descriptions, the PhraseMatcher
accepts match patterns in the form of Doc
objects. See the usage guide for
examples.
PhraseMatcher.__init__ method
Create the rule-based PhraseMatcher
. Setting a different attr
to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming Doc
is checked for sequences of tokens with the same
ORTH
value, i.e. the verbatim token text. Matching on the attribute LOWER
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it’s also possible to match on sequences of the same
part-of-speech tags or dependency labels.
If validate=True
is set, additional validation is performed when pattern are
added. At the moment, it will check whether a Doc
has attributes assigned that
aren’t necessary to produce the matches (for example, part-of-speech tags if the
PhraseMatcher
matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a UserWarning
will
be shown.
Name | Description |
---|---|
vocab | The vocabulary object, which must be shared with the documents the matcher will operate on. Vocab |
attr | The token attribute to match on. Defaults to ORTH , i.e. the verbatim token text. Union[int, str] |
validate | Validate patterns added to the matcher. bool |
PhraseMatcher.__call__ method
Find all token sequences matching the supplied patterns on the Doc
or Span
.
Name | Description |
---|---|
doclike | The Doc or Span to match over. Union[Doc,Span] |
keyword-only | |
as_spans v3.0 | Instead of tuples, return a list of Span objects of the matches, with the match_id assigned as the span label. Defaults to False . bool |
RETURNS | A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end ]. The match_id is the ID of the added match pattern. If as_spans is set to True , a list of Span objects is returned instead. Union[List[Tuple[int, int, int]], List[Span]] |
PhraseMatcher.__len__ method
Get the number of rules added to the matcher. Note that this only returns the number of rules (identical with the number of IDs), not the number of individual patterns.
Name | Description |
---|---|
RETURNS | The number of rules. int |
PhraseMatcher.__contains__ method
Check whether the matcher contains rules for a match ID.
Name | Description |
---|---|
key | The match ID. str |
RETURNS | Whether the matcher contains rules for this match ID. bool |
PhraseMatcher.add method
Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments matcher
, doc
, i
and matches
. If a pattern already exists for
the given ID, the patterns will be extended. An on_match
callback will be
overwritten.
Name | Description |
---|---|
key | An ID for the thing you’re matching. str |
docs | Doc objects of the phrases to match. List[Doc] |
keyword-only | |
on_match | Callback function to act on matches. Takes the arguments matcher , doc , i and matches . Optional[Callable[[Matcher,Doc, int, List[tuple], Any]] |
PhraseMatcher.remove method
Remove a rule from the matcher by match ID. A KeyError
is raised if the key
does not exist.
Name | Description |
---|---|
key | The ID of the match rule. str |