SentencePiece

raw JSON →
0.2.1 verified Tue May 12 auth: no python install: draft quickstart: stale

SentencePiece is an unsupervised text tokenizer and detokenizer, primarily designed for Neural Network-based text generation systems where the vocabulary size is predetermined. It implements subword units like Byte-Pair Encoding (BPE) and Unigram Language Model, capable of training directly from raw sentences without pre-tokenization. The library is actively maintained with regular updates. The current version is 0.2.1.

pip install sentencepiece
error ModuleNotFoundError: No module named 'sentencepiece'
cause The `sentencepiece` Python package is not installed in the current environment or is not accessible in the Python path.
fix
Run pip install sentencepiece in your terminal to install the library.
error sentencepiece.SentencePieceError: Cannot open file
cause The SentencePiece processor failed to load a model because the specified model file path is incorrect, the file does not exist, or the model file is corrupted.
fix
Verify that the .model file exists at the provided path, is accessible, and is a valid SentencePiece model file. Double-check the path spelling and file permissions.
error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xXX in position Y: invalid start byte
cause The input text file being read by SentencePiece (e.g., for training or processing) is not encoded in UTF-8, but SentencePiece expects UTF-8 by default.
fix
Ensure your input text files are saved with UTF-8 encoding. If not possible, read the file with its correct encoding and then process the resulting string, or try specifying the encoding if the SentencePiece method supports it.
error sentencepiece.SentencePieceError: Input file is empty.
cause The input file(s) provided to `spm.SentencePieceTrainer.train` for model training are empty, do not exist, or their paths are incorrect, resulting in no data for training.
fix
Confirm that the input file(s) specified in the input argument for SentencePieceTrainer.train contain valid text data and that their paths are correct and accessible.
breaking Starting from version 0.2.0, `sentencepiece` requires Python 3.9 or newer. Users on older Python versions (e.g., 3.8 or below) will encounter installation failures.
fix Upgrade your Python environment to 3.9 or later, or pin `sentencepiece` to `0.1.99` or an earlier compatible version.
gotcha If a pre-built wheel is not available for your specific Python version, operating system, or CPU architecture, `pip install sentencepiece` will attempt to build from source. This process requires a C++ compiler, CMake, and Python development headers to be installed on your system.
fix Ensure that necessary build tools (like `cmake`, C++ compiler) and Python development headers are installed for your environment if `pip install` fails. It's often easier to use a Python version for which pre-built wheels are readily available.
gotcha Version 0.2.0 of `sentencepiece` had known compatibility issues with other libraries, specifically `transformers` and `tensorflow`, due to a flag redefinition that could lead to Python kernel crashes.
fix Users encountering this issue with `v0.2.0` should upgrade to `v0.2.1` or a newer version, as the fix has been merged.
gotcha Version 0.2.1 introduces experimental free-threading support. While `const` and `static` methods like `encode()` and `decode()` are designed to work without the GIL, non-const methods such as `load()` may have potential data race issues.
fix If using `sentencepiece` in a free-threaded environment and calling non-const methods like `load()`, ensure appropriate explicit locks are implemented to prevent data races.
gotcha When training a SentencePiece model, the `spm.SentencePieceTrainer.train()` method is optimized for file-based input, expecting a raw text file (typically one sentence per line). While it can accept an iterable, for large datasets, providing a file path (or a file-like object in environments with limited local filesystem access) is the standard and most efficient approach.
fix Prepare your training data in a plain text file, with one sentence per line, and pass the file path to the `input` argument of `SentencePieceTrainer.train()`.
breaking When training a SentencePiece model, if the input corpus is extremely small or has very limited unique characters/sequences, requesting a `vocab_size` that exceeds the maximum possible vocabulary derivable from the data can lead to a `RuntimeError` stating 'Vocabulary size too high' with a specific upper limit.
fix Ensure your training corpus is sufficiently large and diverse to support the desired `vocab_size`. If the corpus is intentionally small, reduce the `vocab_size` parameter in `SentencePieceTrainer.train()` to a value less than or equal to the maximum allowed size specified in the error message (e.g., `<= 33` in this case).
python os / libc status wheel install import disk
3.10 alpine (musl) build_error - - - -
3.10 alpine (musl) - - - -
3.10 slim (glibc) wheel 1.6s 0.04s 22M
3.10 slim (glibc) - - 0.03s 22M
3.11 alpine (musl) build_error - - - -
3.11 alpine (musl) - - - -
3.11 slim (glibc) wheel 1.8s 0.06s 24M
3.11 slim (glibc) - - 0.07s 24M
3.12 alpine (musl) build_error - - - -
3.12 alpine (musl) - - - -
3.12 slim (glibc) wheel 1.5s 0.09s 15M
3.12 slim (glibc) - - 0.09s 15M
3.13 alpine (musl) build_error - - - -
3.13 alpine (musl) - - - -
3.13 slim (glibc) wheel 1.5s 0.08s 15M
3.13 slim (glibc) - - 0.08s 15M
3.9 alpine (musl) build_error - - - -
3.9 alpine (musl) - - - -
3.9 slim (glibc) wheel 1.9s 0.04s 21M
3.9 slim (glibc) - - 0.04s 21M

This quickstart demonstrates how to train a SentencePiece model from a text file, load the trained model, and then use it to encode text into subword pieces and IDs, and decode IDs back to text. The `input` parameter for training expects a file path.

import sentencepiece as spm
import os

# Create a dummy text file for training
corpus_content = "This is a test sentence. SentencePiece is awesome.\nAnother example sentence for training."
corpus_file = "corpus.txt"
with open(corpus_file, "w", encoding="utf-8") as f:
    f.write(corpus_content)

model_prefix = "m_model"
vocab_size = 8000

# Train a SentencePiece model
spm.SentencePieceTrainer.train(
    input=corpus_file,
    model_prefix=model_prefix,
    vocab_size=vocab_size
)

# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load(f"{model_prefix}.model")

# Encode text
text_to_encode = "SentencePiece tokenization is powerful."
encoded_pieces = sp.encode_as_pieces(text_to_encode)
encoded_ids = sp.encode_as_ids(text_to_encode)

print(f"Original text: {text_to_encode}")
print(f"Encoded pieces: {encoded_pieces}")
print(f"Encoded IDs: {encoded_ids}")

# Decode IDs back to text
decoded_text = sp.decode_ids(encoded_ids)
print(f"Decoded text: {decoded_text}")

# Clean up generated model files
os.remove(f"{model_prefix}.model")
os.remove(f"{model_prefix}.vocab")
os.remove(corpus_file)