Curated Tokenizers

2.0.0 · active · verified Mon Apr 13

Curated Tokenizers is a lightweight Python library by Explosion (creators of spaCy) that provides efficient and production-ready implementations of various piece tokenization algorithms, including Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. It focuses on fast, reliable tokenization suitable for integrating into larger NLP pipelines. The library is currently at version 2.0.0, with an active but less frequent release cadence focused on performance and stability.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to instantiate and use a `ByteBPEProcessor` for encoding and decoding text. Note that while this example creates a processor in memory, typical usage involves loading pre-trained models from files using `from_file` methods (e.g., for `vocab.json` and `merges.txt`).

from curated_tokenizers import ByteBPEProcessor

# Create a minimal, in-memory ByteBPE processor for demonstration.
# In a real application, you would load pre-trained models from files
# using methods like `ByteBPEProcessor.from_file(vocab_path, merges_path)`.

# Define a simple vocabulary mapping tokens (as bytes) to IDs
token_to_id = {
    b"<unk>": 0, b"a": 1, b"b": 2, b"c": 3, b"ab": 4, b"abc": 5
}
# Reverse mapping from IDs to tokens
id_to_token = {v: k for k, v in token_to_id.items()}

# Define some merge rules (as tuples of bytes)
merges = [
    (b"a", b"b"),
    (b"ab", b"c")
]

# Instantiate the ByteBPEProcessor
processor = ByteBPEProcessor(
    token_to_id=token_to_id,
    id_to_token=id_to_token,
    bpe_merges=merges,
    dropout=0.0, # Use 0.0 for deterministic tokenization
    unk_id=token_to_id[b"<unk>"]
)

text = "abc abc"
print(f"Original text: '{text}'")

# Encode the text into a list of integer IDs
encoded_ids = processor.encode(text)
print(f"Encoded IDs: {encoded_ids}")

# Decode the IDs back into a string
decoded_text = processor.decode_from_ids(encoded_ids)
print(f"Decoded text: '{decoded_text}'")

view raw JSON →