{"id":8562,"library":"pytorch-tokenizers","title":"PyTorch Tokenizers","description":"PyTorch-Tokenizers is a Python package providing efficient C++ implementations for common tokenizers like SentencePiece and TikToken, along with Python bindings. It is primarily designed to serve as a dependency for other PyTorch projects, such as ExecuTorch and torchchat, to facilitate building high-performance LLM runners. The library offers significant efficiency gains for AI workloads, multilingual support, and high decode accuracy. It is actively maintained, with version 1.2.0 aligning its releases with major PyTorch and ExecuTorch updates.","status":"active","version":"1.2.0","language":"en","source_language":"en","source_url":"https://github.com/meta-pytorch/tokenizers","tags":["pytorch","tokenizer","nlp","executorch","sentencepiece","tiktoken","c++"],"install":[{"cmd":"pip install pytorch-tokenizers","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required Python version.","package":"python","version":">=3.10","optional":false}],"imports":[{"note":"This import is for `torchtext`, a separate library with its own SentencePiece tokenizer implementation. `pytorch-tokenizers` provides its own distinct implementation.","wrong":"from torchtext.transforms import SentencePieceTokenizer","symbol":"SentencePieceTokenizer","correct":"from pytorch_tokenizers import SentencePieceTokenizer"},{"note":"This import is for Hugging Face's `transformers` library, which is a different tokenizer ecosystem. `pytorch-tokenizers` is a distinct PyTorch-specific implementation.","wrong":"from transformers import SentencePieceTokenizer","symbol":"SentencePieceTokenizer","correct":"from pytorch_tokenizers import SentencePieceTokenizer"}],"quickstart":{"code":"import os\nimport tempfile\nimport sentencepiece as spm # Required for generating dummy model\nfrom pytorch_tokenizers import SentencePieceTokenizer\n\n# 1. Create a dummy SentencePiece model file for demonstration\n#    In real-world scenarios, you would use an existing pre-trained model.\nmodel_prefix = os.path.join(tempfile.gettempdir(), 'm_test')\nmodel_file = f'{model_prefix}.model'\nvocab_file = f'{model_prefix}.vocab'\n\n# Ensure clean slate for temporary files\nif os.path.exists(model_file): os.remove(model_file)\nif os.path.exists(vocab_file): os.remove(vocab_file)\n\ntext_data = \"Hello world. This is a test sentence. SentencePiece is great!\"\nwith open(f'{model_prefix}.txt', 'w') as f:\n    f.write(text_data)\n\nspm.SentencePieceTrainer.train(\n    input=f'{model_prefix}.txt',\n    model_prefix=model_prefix,\n    vocab_size=10,\n    model_type='bpe'\n)\n\n# 2. Instantiate the SentencePieceTokenizer from the created model file\ntokenizer = SentencePieceTokenizer.from_file(model_file)\n\n# 3. Encode text\ninput_text = \"This is a sample text for tokenization.\"\nencoded_tokens = tokenizer.encode(input_text)\nprint(f\"Original text: {input_text}\")\nprint(f\"Encoded token IDs: {encoded_tokens}\")\n\n# 4. Decode tokens\ndecoded_text = tokenizer.decode(encoded_tokens)\nprint(f\"Decoded text: {decoded_text}\")\n\n# Clean up temporary files\nos.remove(f'{model_prefix}.txt')\nos.remove(model_file)\nos.remove(vocab_file)","lang":"python","description":"This quickstart demonstrates how to initialize and use the `SentencePieceTokenizer` from `pytorch-tokenizers`. Note that `SentencePieceTokenizer` requires a pre-trained SentencePiece model file (`.model`). For a runnable example, we temporarily generate a dummy model using the `sentencepiece` library. In practical applications, you would typically load an existing model file."},"warnings":[{"fix":"Always check the release notes of `pytorch-tokenizers` and its associated PyTorch/ExecuTorch versions. Upgrade `pytorch-tokenizers` to match your PyTorch/ExecuTorch installation.","message":"PyTorch-Tokenizers maintains tight version alignment with PyTorch and ExecuTorch. Major version changes or significant updates in these upstream libraries may introduce incompatibilities or require an update to `pytorch-tokenizers` for continued functionality.","severity":"breaking","affected_versions":"<1.2.0"},{"fix":"Consult the `pytorch-tokenizers` GitHub repository's `README.md` and related ExecuTorch documentation for usage patterns. Do not assume feature parity or API identicality with other tokenizer libraries.","message":"The `pytorch-tokenizers` library is primarily an internal dependency for PyTorch's on-device AI efforts (like ExecuTorch). As such, it lacks extensive standalone documentation and examples compared to general-purpose tokenization libraries (e.g., Hugging Face's `tokenizers` or `torchtext`). Users expecting a feature-rich, standalone API similar to these other libraries might find the direct usage less intuitive.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure you have access to the necessary model files. These are typically generated by training SentencePiece on a corpus or obtained alongside pre-trained language models from sources like Hugging Face. The quickstart demonstrates generating a dummy file for local testing.","message":"Tokenizers provided by `pytorch-tokenizers`, such as `SentencePieceTokenizer`, require pre-trained model files (e.g., `.model` for SentencePiece) to be instantiated. These model files are not distributed with the `pytorch-tokenizers` Python package itself.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install pytorch-tokenizers` to install the package.","cause":"The `pytorch-tokenizers` package is not installed in your current Python environment.","error":"ModuleNotFoundError: No module named 'pytorch_tokenizers'"},{"fix":"Verify that the path to your `.model` file is correct and that the file exists at that location. Ensure proper file permissions if the file is present but still inaccessible.","cause":"The `SentencePieceTokenizer.from_file()` method was called with a path to a SentencePiece model file that does not exist or is inaccessible.","error":"FileNotFoundError: No such file or directory: 'your_model.model'"},{"fix":"Check the integrity of your SentencePiece model file. Try regenerating it if you have the original training data or re-downloading it from its source. Ensure it's a valid `.model` file generated by the `sentencepiece` library.","cause":"The SentencePiece model file (`.model`) provided to `SentencePieceTokenizer.from_file()` is either empty, corrupted, or not a valid SentencePiece model.","error":"RuntimeError: The sentencepiece model file is not found or corrupted. (pytorch_tokenizers.cpp)"}]}