RustBPE Tokenizer

0.1.0 · active · verified Thu Apr 16

RustBPE is a Python library that provides a fast Byte Pair Encoding (BPE) tokenizer implemented in Rust, with Python bindings. It is designed primarily for training GPT-style BPE tokenizers and offers features like parallel processing, GPT-4 style regex pre-tokenization, and direct export to the tiktoken format for efficient inference. Currently at version 0.1.0, it is an initial release, suggesting active and potentially rapid development.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the `rustbpe.Tokenizer`, train it on a small dataset, and then use it to encode and decode text, including batch operations. It also shows the (optional) export capability to the tiktoken format.

import rustbpe
import os

# Create a tokenizer instance
tokenizer = rustbpe.Tokenizer()

# Prepare some sample training data
training_texts = [
    "hello world",
    "this is a test sentence",
    "rustbpe is fast and efficient"
]

# Train the tokenizer
# vocab_size is a crucial parameter defining the output vocabulary size
tokenizer.train_from_iterator(training_texts, vocab_size=256) # Small vocab for example

# Encode text
text_to_encode = "hello rustbpe, how are you today?"
ids = tokenizer.encode(text_to_encode)
print(f"Encoded IDs: {ids}")

# Decode IDs back to text
decoded_text = tokenizer.decode(ids)
print(f"Decoded Text: {decoded_text}")

# Batch encode multiple texts (uses parallelization)
batch_texts = ["text one", "text two", "text three"]
all_ids = tokenizer.batch_encode(batch_texts)
print(f"Batch Encoded IDs: {all_ids}")

# Optional: Export to tiktoken format (requires tiktoken to be installed)
# if os.environ.get('ENABLE_TIKTOKEN_EXPORT', 'false').lower() == 'true':
#     import tiktoken
#     tiktoken_tokenizer = tokenizer.export_to_tiktoken()
#     print("Tokenizer exported to tiktoken format.")

print(f"Vocabulary size: {tokenizer.vocab_size}")

view raw JSON →