Memory Management
spaCy maintains a few internal caches that improve speed, but cause memory to increase slightly over time. If you’re running a batch process that you don’t need to be long-lived, the increase in memory usage generally isn’t a problem. However, if you’re running spaCy inside a web service, you’ll often want spaCy’s memory usage to stay consistent. Transformer models can also run into memory problems sometimes, especially when used on a GPU.
Memory zones
You can tell spaCy to free data from its internal caches (especially the
Vocab
) using the Language.memory_zone
context manager. Enter
the contextmanager and process your text within it, and spaCy will
reset its internal caches (freeing up the associated memory) at the
end of the block. spaCy objects created inside the memory zone must
not be accessed once the memory zone is finished.
Using memory zones
spaCy needs the memory zone contextmanager because the processing pipeline
can’t keep track of which Doc
objects are referring to data in the shared
Vocab
cache. For instance, when spaCy encounters a new word, a new Lexeme
entry is stored in the Vocab
, and the Doc
object points to this shared
data. When the Doc
goes out of scope, the Vocab
has no way of knowing that
this Lexeme
is no longer in use.
The memory zone solves this problem by allowing you to tell the processing pipeline that all data created between two points is no longer in use. It is up to the you to honor this agreement. If you access objects that are supposed to no longer be in use, you may encounter a segmentation fault due to invalid memory access.
A common use case for memory zones will be within a web service. The processing pipeline can be loaded once, either as a context variable or a global, and each request can be handled within a memory zone:
Memory zones with FastAPI
Clearing transformer tensors and other Doc attributes
The Transformer
and Tok2Vec
components set intermediate values onto the Doc
object during parsing. This can cause GPU memory to be exhausted if many Doc
objects are kept in memory together.
To resolve this, you can add the doc_cleaner
component to your pipeline. By default
this will clean up the Doc._.trf_data
extension attribute and the Doc.tensor
attribute.
You can have it clean up other intermediate extension attributes you use in custom
pipeline components as well.
Adding the doc_cleaner