{"library":"rustbpe","title":"RustBPE Tokenizer","description":"RustBPE is a Python library that provides a fast Byte Pair Encoding (BPE) tokenizer implemented in Rust, with Python bindings. It is designed primarily for training GPT-style BPE tokenizers and offers features like parallel processing, GPT-4 style regex pre-tokenization, and direct export to the tiktoken format for efficient inference. Currently at version 0.1.0, it is an initial release, suggesting active and potentially rapid development.","language":"python","status":"active","last_verified":"Tue May 12","install":{"commands":["pip install rustbpe"],"cli":null},"imports":["import rustbpe\ntokenizer = rustbpe.Tokenizer()"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import rustbpe\nimport os\n\n# Create a tokenizer instance\ntokenizer = rustbpe.Tokenizer()\n\n# Prepare some sample training data\ntraining_texts = [\n    \"hello world\",\n    \"this is a test sentence\",\n    \"rustbpe is fast and efficient\"\n]\n\n# Train the tokenizer\n# vocab_size is a crucial parameter defining the output vocabulary size\ntokenizer.train_from_iterator(training_texts, vocab_size=256) # Small vocab for example\n\n# Encode text\ntext_to_encode = \"hello rustbpe, how are you today?\"\nids = tokenizer.encode(text_to_encode)\nprint(f\"Encoded IDs: {ids}\")\n\n# Decode IDs back to text\ndecoded_text = tokenizer.decode(ids)\nprint(f\"Decoded Text: {decoded_text}\")\n\n# Batch encode multiple texts (uses parallelization)\nbatch_texts = [\"text one\", \"text two\", \"text three\"]\nall_ids = tokenizer.batch_encode(batch_texts)\nprint(f\"Batch Encoded IDs: {all_ids}\")\n\n# Optional: Export to tiktoken format (requires tiktoken to be installed)\n# if os.environ.get('ENABLE_TIKTOKEN_EXPORT', 'false').lower() == 'true':\n#     import tiktoken\n#     tiktoken_tokenizer = tokenizer.export_to_tiktoken()\n#     print(\"Tokenizer exported to tiktoken format.\")\n\nprint(f\"Vocabulary size: {tokenizer.vocab_size}\")\n","lang":"python","description":"This quickstart demonstrates how to initialize the `rustbpe.Tokenizer`, train it on a small dataset, and then use it to encode and decode text, including batch operations. It also shows the (optional) export capability to the tiktoken format.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-12","installed_version":null,"pypi_latest":"0.1.0","is_stale":null,"summary":{"python_range":"3.10–3.9","success_rate":50,"avg_install_s":1.9,"avg_import_s":0.02,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"rustbpe","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"rustbpe","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.8,"import_time_s":0.01,"mem_mb":0.7,"disk_size":"21M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"rustbpe","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":0.1,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"rustbpe","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.9,"import_time_s":0.03,"mem_mb":1.1,"disk_size":"23M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"rustbpe","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":0.1,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"rustbpe","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.8,"import_time_s":0.03,"mem_mb":0.9,"disk_size":"15M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"rustbpe","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"rustbpe","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.9,"import_time_s":0.03,"mem_mb":1,"disk_size":"14M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"rustbpe","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"rustbpe","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":2,"import_time_s":0.01,"mem_mb":0.7,"disk_size":"20M"}]}}