spaCy Alignments
spacy-alignments is a Python library that provides efficient tokenization alignment capabilities, particularly useful for integrating different NLP tools like spaCy and transformer models. It offers Python bindings for Yohei Tamura's highly performant Rust `tokenizations` library. The current version is 0.9.2, with releases primarily focused on supporting new Python versions and underlying PyO3 updates.
Common errors
-
UserWarning: [W030] Some entities could not be aligned in the text "..." with entities "[...]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.cause Character-based entity offsets in training data do not perfectly map to the token boundaries generated by spaCy's tokenizer, which is a common issue when combining different tokenization strategies or handling noisy text.fixInspect your training data's character offsets and original text carefully. Use the suggested `spacy.training.offsets_to_biluo_tags` (for spaCy v3+) or `spacy.gold.biluo_tags_from_offsets` (for spaCy v2.3+) to debug specific misalignments. Adjust entity start/end indices to precisely match token boundaries, or modify spaCy's tokenizer exceptions for better alignment. -
ModuleNotFoundError: No module named 'spacy_alignments'
cause The `spacy-alignments` package is not installed or not accessible in the current Python environment.fixInstall the package using `pip install spacy-alignments`. If using virtual environments, ensure the correct environment is activated. -
error: can't find crate `tokenizations`
cause This error, or similar Rust compilation errors (e.g., `command 'rustc' not found`), indicates that a pre-built binary wheel for `spacy-alignments` was not available for your system, and the build process failed because the Rust toolchain is missing or misconfigured.fixInstall Rust by following the instructions at `https://rustup.rs/`. Ensure that the `cargo` command is available in your system's PATH. You may also try upgrading `pip` and `setuptools` (`pip install -U pip setuptools`) before retrying the installation.
Warnings
- breaking Version 0.9.0 dropped support for Python 3.6.
- gotcha Installation may require Rust compiler if pre-built binary wheels are not available for your platform and Python version.
- gotcha When using `spacy-transformers` v1.2 or newer, the alignment between spaCy tokens and transformer tokens for 'fast tokenizers' may differ from previous versions. This is because `spacy-transformers` now uses exact alignment from the underlying tokenizers instead of `spacy-alignments`'s heuristic method.
Install
-
pip install spacy-alignments
Imports
- get_alignments
import spacy_alignments as tokenizations a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
Quickstart
import spacy_alignments as tokenizations
# Example from spacy-alignments README/PyPI
tokens_a = ["å", "BC"]
tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
# Get alignment mappings for two different tokenizations
a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
print(f"Alignment from tokens_a to tokens_b: {a2b}")
print(f"Alignment from tokens_b to tokens_a: {b2a}")