{"id":7695,"library":"rustbpe","title":"RustBPE Tokenizer","description":"RustBPE is a Python library that provides a fast Byte Pair Encoding (BPE) tokenizer implemented in Rust, with Python bindings. It is designed primarily for training GPT-style BPE tokenizers and offers features like parallel processing, GPT-4 style regex pre-tokenization, and direct export to the tiktoken format for efficient inference. Currently at version 0.1.0, it is an initial release, suggesting active and potentially rapid development.","status":"active","version":"0.1.0","language":"en","source_language":"en","source_url":"https://github.com/karpathy/rustbpe","tags":["BPE","tokenizer","NLP","Rust","Python bindings","LLM","tiktoken","training"],"install":[{"cmd":"pip install rustbpe","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Commonly used for inference after training a tokenizer with rustbpe, as rustbpe can export models to tiktoken's format.","package":"tiktoken","optional":true}],"imports":[{"note":"The primary class for BPE operations is directly available under the 'rustbpe' module.","symbol":"Tokenizer","correct":"import rustbpe\ntokenizer = rustbpe.Tokenizer()"}],"quickstart":{"code":"import rustbpe\nimport os\n\n# Create a tokenizer instance\ntokenizer = rustbpe.Tokenizer()\n\n# Prepare some sample training data\ntraining_texts = [\n    \"hello world\",\n    \"this is a test sentence\",\n    \"rustbpe is fast and efficient\"\n]\n\n# Train the tokenizer\n# vocab_size is a crucial parameter defining the output vocabulary size\ntokenizer.train_from_iterator(training_texts, vocab_size=256) # Small vocab for example\n\n# Encode text\ntext_to_encode = \"hello rustbpe, how are you today?\"\nids = tokenizer.encode(text_to_encode)\nprint(f\"Encoded IDs: {ids}\")\n\n# Decode IDs back to text\ndecoded_text = tokenizer.decode(ids)\nprint(f\"Decoded Text: {decoded_text}\")\n\n# Batch encode multiple texts (uses parallelization)\nbatch_texts = [\"text one\", \"text two\", \"text three\"]\nall_ids = tokenizer.batch_encode(batch_texts)\nprint(f\"Batch Encoded IDs: {all_ids}\")\n\n# Optional: Export to tiktoken format (requires tiktoken to be installed)\n# if os.environ.get('ENABLE_TIKTOKEN_EXPORT', 'false').lower() == 'true':\n#     import tiktoken\n#     tiktoken_tokenizer = tokenizer.export_to_tiktoken()\n#     print(\"Tokenizer exported to tiktoken format.\")\n\nprint(f\"Vocabulary size: {tokenizer.vocab_size}\")\n","lang":"python","description":"This quickstart demonstrates how to initialize the `rustbpe.Tokenizer`, train it on a small dataset, and then use it to encode and decode text, including batch operations. It also shows the (optional) export capability to the tiktoken format."},"warnings":[{"fix":"Understand that rustbpe provides the 'missing tiktoken training code' and plan your workflow accordingly, potentially using tiktoken for production inference.","message":"RustBPE is optimized for training BPE tokenizers and exporting them to the tiktoken format for inference. While it offers encoding/decoding, its primary value proposition lies in the training aspect, distinguishing it from libraries solely focused on inference.","severity":"gotcha","affected_versions":"0.1.0+"},{"fix":"Pin your dependency to a specific minor version (`rustbpe==0.1.*`) in production environments and review release notes for breaking changes when updating.","message":"The library is in its initial `0.1.0` release. While tested, the API may be subject to changes and refinements in subsequent minor versions as the project matures.","severity":"gotcha","affected_versions":"0.1.0"},{"fix":"Review the tokenizer's pre-tokenization regex if exact compatibility with non-GPT-4 style models is critical. The library's source code indicates the specific pattern used.","message":"RustBPE defaults to GPT-4 style regex pre-tokenization. If you need to match tokenization behavior of older GPT models (e.g., GPT-2/3) or other tokenizer types, this default pattern might yield different token splits.","severity":"gotcha","affected_versions":"0.1.0+"},{"fix":"While functional, be aware that future performance or maintainability improvements might arise from Rust-specific code optimizations. Report any observed inefficiencies or bugs on the GitHub issues page.","message":"The author notes that while the Python reference code is expert-level and equality tests pass, the underlying Rust implementation had significant AI assistance and might not be optimally structured by a Rust expert.","severity":"gotcha","affected_versions":"0.1.0+"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install rustbpe` to install the package.","cause":"The `rustbpe` package is not installed in the current Python environment.","error":"ModuleNotFoundError: No module named 'rustbpe'"},{"fix":"Ensure `vocab_size` is passed as a keyword argument, e.g., `tokenizer.train_from_iterator(data_iterator, vocab_size=32768)`.","cause":"The `train_from_iterator` method requires `vocab_size` to be explicitly provided, which determines the final size of the tokenizer's vocabulary.","error":"TypeError: train_from_iterator() missing 1 required positional argument: 'vocab_size'"},{"fix":"Install Rust by following instructions on `rustup.rs` (e.g., `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`) and then retry `pip install rustbpe`. Ensure `maturin` is also installed if building directly from the git repository.","cause":"This error occurs when attempting to install `rustbpe` from source (e.g., if a pre-compiled wheel isn't available for your system or Python version) without a Rust toolchain installed.","error":"error: can't find Rust compiler"}]}