Curated Tokenizers
Curated Tokenizers is a lightweight Python library by Explosion (creators of spaCy) that provides efficient and production-ready implementations of various piece tokenization algorithms, including Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. It focuses on fast, reliable tokenization suitable for integrating into larger NLP pipelines. The library is currently at version 2.0.0, with an active but less frequent release cadence focused on performance and stability.
Warnings
- breaking The package was renamed from `cutlery` to `curated-tokenizers`. Users migrating from `cutlery` must update their import statements and package names.
- gotcha The `SentencePieceProcessor` requires the `sentencepiece` library, which is an optional dependency. It must be installed separately.
- gotcha All piece processors (ByteBPEProcessor, WordPieceProcessor, SentencePieceProcessor) are designed to load pre-trained models from files. Instantiating them directly in memory for simple demos (as done in the quickstart) is possible but often more complex than loading an existing model file.
- gotcha Version 2.0.0 primarily introduces performance improvements for Byte BPE encoding. While the API is generally stable, major version bumps can sometimes involve subtle behavioral shifts. Review your existing code for any unexpected changes.
Install
-
pip install curated-tokenizers -
pip install curated-tokenizers[sentencepiece]
Imports
- ByteBPEProcessor
from curated_tokenizers import ByteBPEProcessor
- WordPieceProcessor
from curated_tokenizers import WordPieceProcessor
- SentencePieceProcessor
from curated_tokenizers import SentencePieceProcessor
Quickstart
from curated_tokenizers import ByteBPEProcessor
# Create a minimal, in-memory ByteBPE processor for demonstration.
# In a real application, you would load pre-trained models from files
# using methods like `ByteBPEProcessor.from_file(vocab_path, merges_path)`.
# Define a simple vocabulary mapping tokens (as bytes) to IDs
token_to_id = {
b"<unk>": 0, b"a": 1, b"b": 2, b"c": 3, b"ab": 4, b"abc": 5
}
# Reverse mapping from IDs to tokens
id_to_token = {v: k for k, v in token_to_id.items()}
# Define some merge rules (as tuples of bytes)
merges = [
(b"a", b"b"),
(b"ab", b"c")
]
# Instantiate the ByteBPEProcessor
processor = ByteBPEProcessor(
token_to_id=token_to_id,
id_to_token=id_to_token,
bpe_merges=merges,
dropout=0.0, # Use 0.0 for deterministic tokenization
unk_id=token_to_id[b"<unk>"]
)
text = "abc abc"
print(f"Original text: '{text}'")
# Encode the text into a list of integer IDs
encoded_ids = processor.encode(text)
print(f"Encoded IDs: {encoded_ids}")
# Decode the IDs back into a string
decoded_text = processor.decode_from_ids(encoded_ids)
print(f"Decoded text: '{decoded_text}'")