What's New in v3.2
spaCy v3.2 adds support for floret
vectors, makes custom Doc
creation and scoring easier, and includes many bug
fixes and improvements. For the trained pipelines, there’s a new transformer
pipeline for Japanese and the Universal Dependencies training data has been
updated across the board to the most recent release.
Registered scoring functions
To customize the scoring, you can specify a scoring function for each component
in your config from the new scorers
registry:
config.cfg (excerpt)
Overwrite settings
Most pipeline components now include an overwrite
setting in the config that
determines whether existing annotation in the Doc
is preserved or overwritten:
config.cfg (excerpt)
Doc input for pipelines
nlp
and nlp.pipe
accept
Doc
input, skipping the tokenizer if a Doc
is provided instead
of a string. This makes it easier to create a Doc
with custom tokenization or
to set custom extensions before processing:
Support for floret vectors
We recently published floret
, an
extended version of fastText that combines fastText’s
subwords with Bloom embeddings for compact, full-coverage vectors. The use of
subwords means that there are no OOV words and due to Bloom embeddings, the
vector table can be kept very small at <100K entries. Bloom embeddings are
already used by HashEmbed in
tok2vec for compact spaCy models.
For easy integration, floret includes a Python wrapper:
A demo project shows how to train and import floret vectors:
Two additional demo projects compare standard fastText vectors with floret vectors for full spaCy pipelines. For agglutinative languages like Finnish or Korean, there are large improvements in performance due to the use of subwords (no OOV words!), with a vector table containing merely 50K entries.
Updates for spacy-transformers v1.1
spacy-transformers
v1.1 has
been refactored to improve serialization and support of inline transformer
components and replacing listeners. In addition, the transformer model output is
provided as
ModelOutput
instead of tuples in
TransformerData.model_output and FullTransformerBatch.model_output.
For
backwards compatibility, the tuple format remains available under
TransformerData.tensors
and FullTransformerBatch.tensors
. See more details
in the transformer API docs.
spacy-transformers
v1.1 also adds support for transformer_config
settings
such as output_attentions
. Additional output is stored under
TransformerData.model_output
. More details are in the
TransformerModel docs. The training speed
has been improved by streamlining allocations for tokenizer output and there is
new support for mixed-precision training.
New transformer package for Japanese
spaCy v3.2 adds a new transformer pipeline package for Japanese
ja_core_news_trf
, which uses the basic
pretokenizer instead of mecab
to limit the number of dependencies required for
the pipeline. Thanks to Hiroshi Matsuda and the spaCy Japanese community for
their contributions!
Pipeline and language updates
- All Universal Dependencies training data has been updated to v2.8.
- The Catalan data, tokenizer and lemmatizer have been updated, thanks to Carlos Rodriguez, Carme Armentano and the Barcelona Supercomputing Center!
- The transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
- Trailing whitespace has been added as a
tok2vec
feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation. - The English attribute ruler patterns have been overhauled to improve
Token.pos
andToken.morph
.
spaCy v3.2 also features a new Irish lemmatizer, support for noun_chunks
in
Portuguese, improved noun_chunks
for Spanish and additional updates for
Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
Notes about upgrading from v3.1
Pipeline package version compatibility
When you’re loading a pipeline package trained with spaCy v3.0 or v3.1, you will
see a warning telling you that the pipeline may be incompatible. This doesn’t
necessarily have to be true, but we recommend running your pipelines against
your test suite or evaluation data to make sure there are no unexpected results.
If you’re using one of the trained pipelines we provide, you should
run spacy download
to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate
.
If you’ve trained your own custom pipeline and you’ve confirmed that it’s still
working as expected, you can update the spaCy version requirements in the
meta.json
:
Updating v3.1 configs
To update a config from spaCy v3.1 with the new v3.2 settings, run
init fill-config
:
In many cases (spacy train
,
spacy.load
), the new defaults will be filled in
automatically, but you’ll need to fill in the new settings to run
debug config
and debug data
.
Notes about upgrading from spacy-transformers v1.0
When you’re loading a transformer pipeline package trained with
spacy-transformers
v1.0
after upgrading to spacy-transformers
v1.1, you’ll see a warning telling you
that the pipeline may be incompatible. spacy-transformers
v1.1 should be able
to import v1.0 transformer
components into the new internal format with no
change in performance, but here we’d also recommend running your test suite to
verify that the pipeline still performs as expected.
If you save your pipeline with nlp.to_disk
, it will
be saved in the new v1.1 format and should be fully compatible with
spacy-transformers
v1.1. Once you’ve confirmed the performance, you can update
the requirements in meta.json
:
If you’re using one of the trained pipelines we provide, you should
run spacy download
to update to the latest version. To
see an overview of all installed packages and their compatibility, you can run
spacy validate
.