RustBPE Tokenizer
RustBPE is a Python library that provides a fast Byte Pair Encoding (BPE) tokenizer implemented in Rust, with Python bindings. It is designed primarily for training GPT-style BPE tokenizers and offers features like parallel processing, GPT-4 style regex pre-tokenization, and direct export to the tiktoken format for efficient inference. Currently at version 0.1.0, it is an initial release, suggesting active and potentially rapid development.
Common errors
-
ModuleNotFoundError: No module named 'rustbpe'
cause The `rustbpe` package is not installed in the current Python environment.fixRun `pip install rustbpe` to install the package. -
TypeError: train_from_iterator() missing 1 required positional argument: 'vocab_size'
cause The `train_from_iterator` method requires `vocab_size` to be explicitly provided, which determines the final size of the tokenizer's vocabulary.fixEnsure `vocab_size` is passed as a keyword argument, e.g., `tokenizer.train_from_iterator(data_iterator, vocab_size=32768)`. -
error: can't find Rust compiler
cause This error occurs when attempting to install `rustbpe` from source (e.g., if a pre-compiled wheel isn't available for your system or Python version) without a Rust toolchain installed.fixInstall Rust by following instructions on `rustup.rs` (e.g., `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`) and then retry `pip install rustbpe`. Ensure `maturin` is also installed if building directly from the git repository.
Warnings
- gotcha RustBPE is optimized for training BPE tokenizers and exporting them to the tiktoken format for inference. While it offers encoding/decoding, its primary value proposition lies in the training aspect, distinguishing it from libraries solely focused on inference.
- gotcha The library is in its initial `0.1.0` release. While tested, the API may be subject to changes and refinements in subsequent minor versions as the project matures.
- gotcha RustBPE defaults to GPT-4 style regex pre-tokenization. If you need to match tokenization behavior of older GPT models (e.g., GPT-2/3) or other tokenizer types, this default pattern might yield different token splits.
- gotcha The author notes that while the Python reference code is expert-level and equality tests pass, the underlying Rust implementation had significant AI assistance and might not be optimally structured by a Rust expert.
Install
-
pip install rustbpe
Imports
- Tokenizer
import rustbpe tokenizer = rustbpe.Tokenizer()
Quickstart
import rustbpe
import os
# Create a tokenizer instance
tokenizer = rustbpe.Tokenizer()
# Prepare some sample training data
training_texts = [
"hello world",
"this is a test sentence",
"rustbpe is fast and efficient"
]
# Train the tokenizer
# vocab_size is a crucial parameter defining the output vocabulary size
tokenizer.train_from_iterator(training_texts, vocab_size=256) # Small vocab for example
# Encode text
text_to_encode = "hello rustbpe, how are you today?"
ids = tokenizer.encode(text_to_encode)
print(f"Encoded IDs: {ids}")
# Decode IDs back to text
decoded_text = tokenizer.decode(ids)
print(f"Decoded Text: {decoded_text}")
# Batch encode multiple texts (uses parallelization)
batch_texts = ["text one", "text two", "text three"]
all_ids = tokenizer.batch_encode(batch_texts)
print(f"Batch Encoded IDs: {all_ids}")
# Optional: Export to tiktoken format (requires tiktoken to be installed)
# if os.environ.get('ENABLE_TIKTOKEN_EXPORT', 'false').lower() == 'true':
# import tiktoken
# tiktoken_tokenizer = tokenizer.export_to_tiktoken()
# print("Tokenizer exported to tiktoken format.")
print(f"Vocabulary size: {tokenizer.vocab_size}")