PyTorch Tokenizers

1.2.0 · active · verified Thu Apr 16

PyTorch-Tokenizers is a Python package providing efficient C++ implementations for common tokenizers like SentencePiece and TikToken, along with Python bindings. It is primarily designed to serve as a dependency for other PyTorch projects, such as ExecuTorch and torchchat, to facilitate building high-performance LLM runners. The library offers significant efficiency gains for AI workloads, multilingual support, and high decode accuracy. It is actively maintained, with version 1.2.0 aligning its releases with major PyTorch and ExecuTorch updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize and use the `SentencePieceTokenizer` from `pytorch-tokenizers`. Note that `SentencePieceTokenizer` requires a pre-trained SentencePiece model file (`.model`). For a runnable example, we temporarily generate a dummy model using the `sentencepiece` library. In practical applications, you would typically load an existing model file.

import os
import tempfile
import sentencepiece as spm # Required for generating dummy model
from pytorch_tokenizers import SentencePieceTokenizer

# 1. Create a dummy SentencePiece model file for demonstration
#    In real-world scenarios, you would use an existing pre-trained model.
model_prefix = os.path.join(tempfile.gettempdir(), 'm_test')
model_file = f'{model_prefix}.model'
vocab_file = f'{model_prefix}.vocab'

# Ensure clean slate for temporary files
if os.path.exists(model_file): os.remove(model_file)
if os.path.exists(vocab_file): os.remove(vocab_file)

text_data = "Hello world. This is a test sentence. SentencePiece is great!"
with open(f'{model_prefix}.txt', 'w') as f:
    f.write(text_data)

spm.SentencePieceTrainer.train(
    input=f'{model_prefix}.txt',
    model_prefix=model_prefix,
    vocab_size=10,
    model_type='bpe'
)

# 2. Instantiate the SentencePieceTokenizer from the created model file
tokenizer = SentencePieceTokenizer.from_file(model_file)

# 3. Encode text
input_text = "This is a sample text for tokenization."
encoded_tokens = tokenizer.encode(input_text)
print(f"Original text: {input_text}")
print(f"Encoded token IDs: {encoded_tokens}")

# 4. Decode tokens
decoded_text = tokenizer.decode(encoded_tokens)
print(f"Decoded text: {decoded_text}")

# Clean up temporary files
os.remove(f'{model_prefix}.txt')
os.remove(model_file)
os.remove(vocab_file)

view raw JSON →