PyTorch Tokenizers
PyTorch-Tokenizers is a Python package providing efficient C++ implementations for common tokenizers like SentencePiece and TikToken, along with Python bindings. It is primarily designed to serve as a dependency for other PyTorch projects, such as ExecuTorch and torchchat, to facilitate building high-performance LLM runners. The library offers significant efficiency gains for AI workloads, multilingual support, and high decode accuracy. It is actively maintained, with version 1.2.0 aligning its releases with major PyTorch and ExecuTorch updates.
Common errors
-
ModuleNotFoundError: No module named 'pytorch_tokenizers'
cause The `pytorch-tokenizers` package is not installed in your current Python environment.fixRun `pip install pytorch-tokenizers` to install the package. -
FileNotFoundError: No such file or directory: 'your_model.model'
cause The `SentencePieceTokenizer.from_file()` method was called with a path to a SentencePiece model file that does not exist or is inaccessible.fixVerify that the path to your `.model` file is correct and that the file exists at that location. Ensure proper file permissions if the file is present but still inaccessible. -
RuntimeError: The sentencepiece model file is not found or corrupted. (pytorch_tokenizers.cpp)
cause The SentencePiece model file (`.model`) provided to `SentencePieceTokenizer.from_file()` is either empty, corrupted, or not a valid SentencePiece model.fixCheck the integrity of your SentencePiece model file. Try regenerating it if you have the original training data or re-downloading it from its source. Ensure it's a valid `.model` file generated by the `sentencepiece` library.
Warnings
- breaking PyTorch-Tokenizers maintains tight version alignment with PyTorch and ExecuTorch. Major version changes or significant updates in these upstream libraries may introduce incompatibilities or require an update to `pytorch-tokenizers` for continued functionality.
- gotcha The `pytorch-tokenizers` library is primarily an internal dependency for PyTorch's on-device AI efforts (like ExecuTorch). As such, it lacks extensive standalone documentation and examples compared to general-purpose tokenization libraries (e.g., Hugging Face's `tokenizers` or `torchtext`). Users expecting a feature-rich, standalone API similar to these other libraries might find the direct usage less intuitive.
- gotcha Tokenizers provided by `pytorch-tokenizers`, such as `SentencePieceTokenizer`, require pre-trained model files (e.g., `.model` for SentencePiece) to be instantiated. These model files are not distributed with the `pytorch-tokenizers` Python package itself.
Install
-
pip install pytorch-tokenizers
Imports
- SentencePieceTokenizer
from torchtext.transforms import SentencePieceTokenizer
from pytorch_tokenizers import SentencePieceTokenizer
- SentencePieceTokenizer
from transformers import SentencePieceTokenizer
from pytorch_tokenizers import SentencePieceTokenizer
Quickstart
import os
import tempfile
import sentencepiece as spm # Required for generating dummy model
from pytorch_tokenizers import SentencePieceTokenizer
# 1. Create a dummy SentencePiece model file for demonstration
# In real-world scenarios, you would use an existing pre-trained model.
model_prefix = os.path.join(tempfile.gettempdir(), 'm_test')
model_file = f'{model_prefix}.model'
vocab_file = f'{model_prefix}.vocab'
# Ensure clean slate for temporary files
if os.path.exists(model_file): os.remove(model_file)
if os.path.exists(vocab_file): os.remove(vocab_file)
text_data = "Hello world. This is a test sentence. SentencePiece is great!"
with open(f'{model_prefix}.txt', 'w') as f:
f.write(text_data)
spm.SentencePieceTrainer.train(
input=f'{model_prefix}.txt',
model_prefix=model_prefix,
vocab_size=10,
model_type='bpe'
)
# 2. Instantiate the SentencePieceTokenizer from the created model file
tokenizer = SentencePieceTokenizer.from_file(model_file)
# 3. Encode text
input_text = "This is a sample text for tokenization."
encoded_tokens = tokenizer.encode(input_text)
print(f"Original text: {input_text}")
print(f"Encoded token IDs: {encoded_tokens}")
# 4. Decode tokens
decoded_text = tokenizer.decode(encoded_tokens)
print(f"Decoded text: {decoded_text}")
# Clean up temporary files
os.remove(f'{model_prefix}.txt')
os.remove(model_file)
os.remove(vocab_file)