{"library":"pytorch-tokenizers","title":"PyTorch Tokenizers","description":"PyTorch-Tokenizers is a Python package providing efficient C++ implementations for common tokenizers like SentencePiece and TikToken, along with Python bindings. It is primarily designed to serve as a dependency for other PyTorch projects, such as ExecuTorch and torchchat, to facilitate building high-performance LLM runners. The library offers significant efficiency gains for AI workloads, multilingual support, and high decode accuracy. It is actively maintained, with version 1.2.0 aligning its releases with major PyTorch and ExecuTorch updates.","language":"python","status":"active","last_verified":"Mon May 18","install":{"commands":["pip install pytorch-tokenizers"],"cli":null},"imports":["from pytorch_tokenizers import SentencePieceTokenizer","from pytorch_tokenizers import SentencePieceTokenizer"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import os\nimport tempfile\nimport sentencepiece as spm # Required for generating dummy model\nfrom pytorch_tokenizers import SentencePieceTokenizer\n\n# 1. Create a dummy SentencePiece model file for demonstration\n#    In real-world scenarios, you would use an existing pre-trained model.\nmodel_prefix = os.path.join(tempfile.gettempdir(), 'm_test')\nmodel_file = f'{model_prefix}.model'\nvocab_file = f'{model_prefix}.vocab'\n\n# Ensure clean slate for temporary files\nif os.path.exists(model_file): os.remove(model_file)\nif os.path.exists(vocab_file): os.remove(vocab_file)\n\ntext_data = \"Hello world. This is a test sentence. SentencePiece is great!\"\nwith open(f'{model_prefix}.txt', 'w') as f:\n    f.write(text_data)\n\nspm.SentencePieceTrainer.train(\n    input=f'{model_prefix}.txt',\n    model_prefix=model_prefix,\n    vocab_size=10,\n    model_type='bpe'\n)\n\n# 2. Instantiate the SentencePieceTokenizer from the created model file\ntokenizer = SentencePieceTokenizer.from_file(model_file)\n\n# 3. Encode text\ninput_text = \"This is a sample text for tokenization.\"\nencoded_tokens = tokenizer.encode(input_text)\nprint(f\"Original text: {input_text}\")\nprint(f\"Encoded token IDs: {encoded_tokens}\")\n\n# 4. Decode tokens\ndecoded_text = tokenizer.decode(encoded_tokens)\nprint(f\"Decoded text: {decoded_text}\")\n\n# Clean up temporary files\nos.remove(f'{model_prefix}.txt')\nos.remove(model_file)\nos.remove(vocab_file)","lang":"python","description":"This quickstart demonstrates how to initialize and use the `SentencePieceTokenizer` from `pytorch-tokenizers`. Note that `SentencePieceTokenizer` requires a pre-trained SentencePiece model file (`.model`). For a runnable example, we temporarily generate a dummy model using the `sentencepiece` library. In practical applications, you would typically load an existing model file.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-18","installed_version":"1.2.0","pypi_latest":"1.2.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":40,"avg_install_s":6.4,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"pytorch-tokenizers","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":7.2,"import_time_s":null,"mem_mb":null,"disk_size":"86M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"pytorch-tokenizers","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":6.8,"import_time_s":null,"mem_mb":null,"disk_size":"92M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"pytorch-tokenizers","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":5.8,"import_time_s":null,"mem_mb":null,"disk_size":"83M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"pytorch-tokenizers","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":5.8,"import_time_s":null,"mem_mb":null,"disk_size":"83M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"pytorch-tokenizers","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":null}]}}