spaCy Alignments

0.9.2 · active · verified Thu Apr 16

spacy-alignments is a Python library that provides efficient tokenization alignment capabilities, particularly useful for integrating different NLP tools like spaCy and transformer models. It offers Python bindings for Yohei Tamura's highly performant Rust `tokenizations` library. The current version is 0.9.2, with releases primarily focused on supporting new Python versions and underlying PyO3 updates.

Common errors

Warnings

Install

Imports

Quickstart

The `get_alignments` function is the core of the library, providing bidirectional mapping between two sequences of tokens that may have undergone different tokenization or normalization processes.

import spacy_alignments as tokenizations

# Example from spacy-alignments README/PyPI
tokens_a = ["å", "BC"]
tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)

# Get alignment mappings for two different tokenizations
a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)

print(f"Alignment from tokens_a to tokens_b: {a2b}")
print(f"Alignment from tokens_b to tokens_a: {b2a}")

view raw JSON →