spaCy Curated Transformers
spacy-curated-transformers provides efficient and curated transformer models designed for integration into spaCy processing pipelines. It wraps the `curated-transformers` library, offering specialized components and utilities for tasks like wordpiece tokenization and transformer-based embeddings within spaCy's `Doc` and `Span` objects. The library is actively maintained by Explosion, with a focus on compatibility with latest spaCy and Thinc versions, and releases often align with improvements in underlying transformer architectures.
Warnings
- breaking The main transformer pipe component was renamed from its original name to `CuratedTransformer` in `v0.2.0`. If you were manually adding the pipe, you must update the component name in your configuration.
- breaking Handling of whitespace tokens changed in `v0.3.1`. When accessing `doc._.trf_data[i]` for a whitespace token, the resulting array now has a shape of `(0, n)` (where `n` is the output dimension) instead of a zeroed row. This might affect custom processing logic that assumes a fixed output shape for all tokens.
- breaking Version `2.0.0` rebased on `curated-transformers` 2.0. This brought significant internal changes and new features (like discriminative learning rates). Direct interaction with underlying `curated-transformers` objects via `spacy-curated-transformers` might require adjustments.
- deprecated Quantization support was explicitly removed in `v0.2.0` until the serialization API for it could be stabilized. If your workflow relied on this feature, it's no longer available.
- gotcha Dependency management across `spacy`, `thinc`, `curated-transformers`, and `numpy` can be complex. Recent releases (e.g., `v2.1.1`, `v2.1.2`, `v0.3.0`) highlight efforts to relax pins and avoid direct `spaCy` dependency to enhance model forward compatibility. However, users must ensure compatible versions are installed to avoid runtime errors (e.g., Thinc 9.1.0 for NumPy v2 compatibility).
Install
-
pip install spacy-curated-transformers
Imports
- CuratedTransformer
from spacy_curated_transformers.pipeline import CuratedTransformer
- DocTransformerOutput
doc._.trf_data
Quickstart
import spacy
# To use spacy-curated-transformers, you typically load a spaCy model
# that includes a 'transformer' component.
# First, ensure you have a compatible model downloaded:
# python -m spacy download en_core_web_trf
try:
# Load a pre-trained spaCy model that utilizes spacy-curated-transformers
nlp = spacy.load("en_core_web_trf")
doc = nlp("Hello, world! This is a test sentence.")
print(f"Processed doc with {len(doc)} tokens.")
# Access transformer data via the custom Doc extension
if doc._.has_extension("trf_data") and doc._.trf_data is not None:
# `trf_data` contains tensors, alignment information, etc.
# For example, the transformer output for the first token:
print(f"Transformer output for token 0 shape: {doc._.trf_data.tensors[0].shape}")
print(f"Transformer output for token 1 shape: {doc._.trf_data.tensors[1].shape}")
else:
print("No transformer data found. Ensure a transformer pipe is in the pipeline.")
except Exception as e:
print(f"Error loading or processing model: {e}")
print("Please ensure 'en_core_web_trf' is downloaded using: python -m spacy download en_core_web_trf")