SentencePiece
SentencePiece is an unsupervised text tokenizer and detokenizer, primarily designed for Neural Network-based text generation systems where the vocabulary size is predetermined. It implements subword units like Byte-Pair Encoding (BPE) and Unigram Language Model, capable of training directly from raw sentences without pre-tokenization. The library is actively maintained with regular updates. The current version is 0.2.1.
Warnings
- breaking Starting from version 0.2.0, `sentencepiece` requires Python 3.9 or newer. Users on older Python versions (e.g., 3.8 or below) will encounter installation failures.
- gotcha If a pre-built wheel is not available for your specific Python version, operating system, or CPU architecture, `pip install sentencepiece` will attempt to build from source. This process requires a C++ compiler, CMake, and Python development headers to be installed on your system.
- gotcha Version 0.2.0 of `sentencepiece` had known compatibility issues with other libraries, specifically `transformers` and `tensorflow`, due to a flag redefinition that could lead to Python kernel crashes.
- gotcha Version 0.2.1 introduces experimental free-threading support. While `const` and `static` methods like `encode()` and `decode()` are designed to work without the GIL, non-const methods such as `load()` may have potential data race issues.
- gotcha When training a SentencePiece model, the `spm.SentencePieceTrainer.train()` method is optimized for file-based input, expecting a raw text file (typically one sentence per line). While it can accept an iterable, for large datasets, providing a file path (or a file-like object in environments with limited local filesystem access) is the standard and most efficient approach.
Install
-
pip install sentencepiece
Imports
- SentencePieceProcessor
import sentencepiece as spm sp = spm.SentencePieceProcessor()
- SentencePieceTrainer
import sentencepiece as spm spm.SentencePieceTrainer.train(...)
Quickstart
import sentencepiece as spm
import os
# Create a dummy text file for training
corpus_content = "This is a test sentence. SentencePiece is awesome.\nAnother example sentence for training."
corpus_file = "corpus.txt"
with open(corpus_file, "w", encoding="utf-8") as f:
f.write(corpus_content)
model_prefix = "m_model"
vocab_size = 8000
# Train a SentencePiece model
spm.SentencePieceTrainer.train(
input=corpus_file,
model_prefix=model_prefix,
vocab_size=vocab_size
)
# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load(f"{model_prefix}.model")
# Encode text
text_to_encode = "SentencePiece tokenization is powerful."
encoded_pieces = sp.encode_as_pieces(text_to_encode)
encoded_ids = sp.encode_as_ids(text_to_encode)
print(f"Original text: {text_to_encode}")
print(f"Encoded pieces: {encoded_pieces}")
print(f"Encoded IDs: {encoded_ids}")
# Decode IDs back to text
decoded_text = sp.decode_ids(encoded_ids)
print(f"Decoded text: {decoded_text}")
# Clean up generated model files
os.remove(f"{model_prefix}.model")
os.remove(f"{model_prefix}.vocab")
os.remove(corpus_file)