Corpus
This class manages annotated corpora and can be used for training and
development datasets in the DocBin
(.spacy
) format. To
customize the data loading during training, you can register your own
data readers and batchers. Also
see the usage guide on data utilities for more details
and examples.
Config and implementation
spacy.Corpus.v1
is a registered function that creates a Corpus
of training
or evaluation data. It takes the same arguments as the Corpus
class and
returns a callable that yields Example
objects. You can
replace it with your own registered function in the
@readers
registry to customize the data loading and
streaming.
Name | Description |
---|---|
path | The directory or filename to read from. Expects data in spaCy’s binary .spacy format. Path |
gold_preproc | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool |
max_length | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
augmenter | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to None . Optional[Callable] |
explosion/spaCy/master/spacy/training/corpus.py
Corpus.__init__ method
Create a Corpus
for iterating Example objects from a file or
directory of .spacy
data files. The
gold_preproc
setting lets you specify whether to set up the Example
object
with gold-standard sentences and tokens for the predictions. Gold preprocessing
helps the annotations align to the tokenization, and may result in sequences of
more consistent length. However, it may reduce runtime accuracy due to
train/test skew.
Name | Description |
---|---|
path | The directory or filename to read from. Union[str,Path] |
keyword-only | |
gold_preproc | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False . bool |
max_length | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
augmenter | Optional data augmentation callback. Callable[[Language,Example], Iterable[Example]] |
shuffle | Whether to shuffle the examples. Defaults to False . bool |
Corpus.__call__ method
Yield examples from the data.
Name | Description |
---|---|
nlp | The current nlp object. Language |
YIELDS | The examples. Example |
JsonlCorpus class
Iterate Doc objects from a file or directory of JSONL (newline-delimited JSON) formatted raw text files. Can be used to read the raw text corpus for language model pretraining from a JSONL file.
Example
JsonlCorpus.__init__ method
Initialize the reader.
Name | Description |
---|---|
path | The directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str,Path] |
keyword-only | |
min_length | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to 0 , which indicates no limit. int |
max_length | Maximum document length (in tokens). Longer documents will be skipped. Defaults to 0 , which indicates no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
JsonlCorpus.__call__ method
Yield examples from the data.
Name | Description |
---|---|
nlp | The current nlp object. Language |
YIELDS | The examples. Example |
PlainTextCorpus classv3.5.1
Iterate over documents from a plain text file. Can be used to read the raw text corpus for language model pretraining. The expected file format is:
- UTF-8 encoding
- One document per line
- Blank lines are ignored.
Example
PlainTextCorpus.__init__ method
Initialize the reader.
Name | Description |
---|---|
path | The directory or filename to read from. Expects newline-delimited documents in UTF8 format. Union[str,Path] |
keyword-only | |
min_length | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to 0 , which indicates no limit. int |
max_length | Maximum document length (in tokens). Longer documents will be skipped. Defaults to 0 , which indicates no limit. int |
PlainTextCorpus.__call__ method
Yield examples from the data.
Name | Description |
---|---|
nlp | The current nlp object. Language |
YIELDS | The examples. Example |