Sentencizer

class

String name:sentencizerTrainable:

Pipeline component for rule-based sentence boundary detection

A simple pipeline component to allow custom sentence boundary detection logic that doesn’t require the dependency parse. By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, rule-based strategy that doesn’t require a statistical model to be loaded.

Assigned Attributes

Calculated values will be assigned to Token.is_sent_start. The resulting sentences can be accessed using Doc.sents.

Location	Value
`Token.is_sent_start`	A boolean value indicating whether the token starts a sentence. This will be either `True` or `False` for all tokens. bool
`Doc.sents`	An iterator over sentences in the `Doc`, determined by `Token.is_sent_start` values. Iterator[Span]

Config and implementation

The default config is defined by the pipeline component factory and describes how the component should be configured. You can override its settings via the config argument on nlp.add_pipe or in your config.cfg for training.

Setting	Description
`punct_chars`	Optional custom list of punctuation characters that mark sentence ends. See below for defaults if not set. Defaults to `None`. Optional[List[str]]
`overwrite` v3.2	Whether existing annotation is overwritten. Defaults to `False`. bool
`scorer` v3.2	The scoring method. Defaults to `Scorer.score_spans` for the attribute `"sents"` Optional[Callable]

explosion/spaCy/master/spacy/pipeline/sentencizer.pyx

Sentencizer.init method

Initialize the sentencizer.

Name	Description
keyword-only
`punct_chars`	Optional custom list of punctuation characters that mark sentence ends. See below for defaults. Optional[List[str]]
`overwrite` v3.2	Whether existing annotation is overwritten. Defaults to `False`. bool
`scorer` v3.2	The scoring method. Defaults to `Scorer.score_spans` for the attribute `"sents"` Optional[Callable]

punct_chars defaults

Sentencizer.call method

Apply the sentencizer on a Doc. Typically, this happens automatically after the component has been added to the pipeline using nlp.add_pipe.

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. Doc
RETURNS	The modified `Doc` with added sentence boundaries. Doc

Sentencizer.pipe method

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order.

Name	Description
`stream`	A stream of documents. Iterable[Doc]
keyword-only
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	The processed documents in order. Doc

Sentencizer.to_disk method

Save the sentencizer settings (punctuation characters) to a directory. Will create a file sentencizer.json. This also happens automatically when you save an nlp object with a sentencizer added to its pipeline.

Name	Description
`path`	A path to a JSON file, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]

Sentencizer.from_disk method

Load the sentencizer settings from a file. Expects a JSON file. This also happens automatically when you load an nlp object or model with a sentencizer added to its pipeline.

Name	Description
`path`	A path to a JSON file. Paths may be either strings or `Path`-like objects. Union[str,Path]
RETURNS	The modified `Sentencizer` object. Sentencizer