Sentencizer
A simple pipeline component to allow custom sentence boundary detection logic
that doesn’t require the dependency parse. By default, sentence segmentation is
performed by the DependencyParser
, so the
Sentencizer
lets you implement a simpler, rule-based strategy that doesn’t
require a statistical model to be loaded.
Assigned Attributes
Calculated values will be assigned to Token.is_sent_start
. The resulting
sentences can be accessed using Doc.sents
.
Location | Value |
---|---|
Token.is_sent_start | A boolean value indicating whether the token starts a sentence. This will be either True or False for all tokens. bool |
Doc.sents | An iterator over sentences in the Doc , determined by Token.is_sent_start values. Iterator[Span] |
Config and implementation
The default config is defined by the pipeline component factory and describes
how the component should be configured. You can override its settings via the
config
argument on nlp.add_pipe
or in your
config.cfg
for training.
Setting | Description |
---|---|
punct_chars | Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to None . Optional[List[str]] |
overwrite v3.2 | Whether existing annotation is overwritten. Defaults to False . bool |
scorer v3.2 | The scoring method. Defaults to Scorer.score_spans for the attribute "sents" Optional[Callable] |
explosion/spaCy/master/spacy/pipeline/sentencizer.pyx
Sentencizer.__init__ method
Initialize the sentencizer.
Name | Description |
---|---|
keyword-only | |
punct_chars | Optional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]] |
overwrite v3.2 | Whether existing annotation is overwritten. Defaults to False . bool |
scorer v3.2 | The scoring method. Defaults to Scorer.score_spans for the attribute "sents" Optional[Callable] |
punct_chars defaults
Sentencizer.__call__ method
Apply the sentencizer on a Doc
. Typically, this happens automatically after
the component has been added to the pipeline using
nlp.add_pipe
.
Name | Description |
---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
RETURNS | The modified Doc with added sentence boundaries. Doc |
Sentencizer.pipe method
Apply the pipe to a stream of documents. This usually happens under the hood
when the nlp
object is called on a text and all pipeline components are
applied to the Doc
in order.
Name | Description |
---|---|
stream | A stream of documents. Iterable[Doc] |
keyword-only | |
batch_size | The number of documents to buffer. Defaults to 128 . int |
YIELDS | The processed documents in order. Doc |
Sentencizer.to_disk method
Save the sentencizer settings (punctuation characters) to a directory. Will
create a file sentencizer.json
. This also happens automatically when you save
an nlp
object with a sentencizer added to its pipeline.
Name | Description |
---|---|
path | A path to a JSON file, which will be created if it doesn’t exist. Paths may be either strings or Path -like objects. Union[str,Path] |
Sentencizer.from_disk method
Load the sentencizer settings from a file. Expects a JSON file. This also
happens automatically when you load an nlp
object or model with a sentencizer
added to its pipeline.
Name | Description |
---|---|
path | A path to a JSON file. Paths may be either strings or Path -like objects. Union[str,Path] |
RETURNS | The modified Sentencizer object. Sentencizer |
Sentencizer.to_bytes method
Serialize the sentencizer settings to a bytestring.
Name | Description |
---|---|
RETURNS | The serialized data. bytes |
Sentencizer.from_bytes method
Load the pipe from a bytestring. Modifies the object in place and returns it.
Name | Description |
---|---|
bytes_data | The bytestring to load. bytes |
RETURNS | The modified Sentencizer object. Sentencizer |