{"id":707,"library":"sentencepiece","title":"SentencePiece","description":"SentencePiece is an unsupervised text tokenizer and detokenizer, primarily designed for Neural Network-based text generation systems where the vocabulary size is predetermined. It implements subword units like Byte-Pair Encoding (BPE) and Unigram Language Model, capable of training directly from raw sentences without pre-tokenization. The library is actively maintained with regular updates. The current version is 0.2.1.","status":"active","version":"0.2.1","language":"python","source_language":"en","source_url":"https://github.com/google/sentencepiece","tags":["nlp","tokenization","subword","machine-learning"],"install":[{"cmd":"pip install sentencepiece","lang":"bash","label":"Standard installation"}],"dependencies":[],"imports":[{"symbol":"SentencePieceProcessor","correct":"import sentencepiece as spm\nsp = spm.SentencePieceProcessor()"},{"symbol":"SentencePieceTrainer","correct":"import sentencepiece as spm\nspm.SentencePieceTrainer.train(...)"}],"quickstart":{"code":"import sentencepiece as spm\nimport os\n\n# Create a dummy text file for training\ncorpus_content = \"This is a test sentence. SentencePiece is awesome.\\nAnother example sentence for training.\"\ncorpus_file = \"corpus.txt\"\nwith open(corpus_file, \"w\", encoding=\"utf-8\") as f:\n    f.write(corpus_content)\n\nmodel_prefix = \"m_model\"\nvocab_size = 8000\n\n# Train a SentencePiece model\nspm.SentencePieceTrainer.train(\n    input=corpus_file,\n    model_prefix=model_prefix,\n    vocab_size=vocab_size\n)\n\n# Load the trained model\nsp = spm.SentencePieceProcessor()\nsp.load(f\"{model_prefix}.model\")\n\n# Encode text\ntext_to_encode = \"SentencePiece tokenization is powerful.\"\nencoded_pieces = sp.encode_as_pieces(text_to_encode)\nencoded_ids = sp.encode_as_ids(text_to_encode)\n\nprint(f\"Original text: {text_to_encode}\")\nprint(f\"Encoded pieces: {encoded_pieces}\")\nprint(f\"Encoded IDs: {encoded_ids}\")\n\n# Decode IDs back to text\ndecoded_text = sp.decode_ids(encoded_ids)\nprint(f\"Decoded text: {decoded_text}\")\n\n# Clean up generated model files\nos.remove(f\"{model_prefix}.model\")\nos.remove(f\"{model_prefix}.vocab\")\nos.remove(corpus_file)","lang":"python","description":"This quickstart demonstrates how to train a SentencePiece model from a text file, load the trained model, and then use it to encode text into subword pieces and IDs, and decode IDs back to text. The `input` parameter for training expects a file path."},"warnings":[{"fix":"Upgrade your Python environment to 3.9 or later, or pin `sentencepiece` to `0.1.99` or an earlier compatible version.","message":"Starting from version 0.2.0, `sentencepiece` requires Python 3.9 or newer. Users on older Python versions (e.g., 3.8 or below) will encounter installation failures.","severity":"breaking","affected_versions":">=0.2.0"},{"fix":"Ensure that necessary build tools (like `cmake`, C++ compiler) and Python development headers are installed for your environment if `pip install` fails. It's often easier to use a Python version for which pre-built wheels are readily available.","message":"If a pre-built wheel is not available for your specific Python version, operating system, or CPU architecture, `pip install sentencepiece` will attempt to build from source. This process requires a C++ compiler, CMake, and Python development headers to be installed on your system.","severity":"gotcha","affected_versions":"all"},{"fix":"Users encountering this issue with `v0.2.0` should upgrade to `v0.2.1` or a newer version, as the fix has been merged.","message":"Version 0.2.0 of `sentencepiece` had known compatibility issues with other libraries, specifically `transformers` and `tensorflow`, due to a flag redefinition that could lead to Python kernel crashes.","severity":"gotcha","affected_versions":"0.2.0"},{"fix":"If using `sentencepiece` in a free-threaded environment and calling non-const methods like `load()`, ensure appropriate explicit locks are implemented to prevent data races.","message":"Version 0.2.1 introduces experimental free-threading support. While `const` and `static` methods like `encode()` and `decode()` are designed to work without the GIL, non-const methods such as `load()` may have potential data race issues.","severity":"gotcha","affected_versions":">=0.2.1"},{"fix":"Prepare your training data in a plain text file, with one sentence per line, and pass the file path to the `input` argument of `SentencePieceTrainer.train()`.","message":"When training a SentencePiece model, the `spm.SentencePieceTrainer.train()` method is optimized for file-based input, expecting a raw text file (typically one sentence per line). While it can accept an iterable, for large datasets, providing a file path (or a file-like object in environments with limited local filesystem access) is the standard and most efficient approach.","severity":"gotcha","affected_versions":"all"},{"fix":"Ensure your training corpus is sufficiently large and diverse to support the desired `vocab_size`. If the corpus is intentionally small, reduce the `vocab_size` parameter in `SentencePieceTrainer.train()` to a value less than or equal to the maximum allowed size specified in the error message (e.g., `<= 33` in this case).","message":"When training a SentencePiece model, if the input corpus is extremely small or has very limited unique characters/sequences, requesting a `vocab_size` that exceeds the maximum possible vocabulary derivable from the data can lead to a `RuntimeError` stating 'Vocabulary size too high' with a specific upper limit.","severity":"breaking","affected_versions":"all"}],"env_vars":null,"last_verified":"2026-05-12T18:02:28.454Z","next_check":"2026-06-26T00:00:00.000Z","problems":[{"fix":"Run `pip install sentencepiece` in your terminal to install the library.","cause":"The `sentencepiece` Python package is not installed in the current environment or is not accessible in the Python path.","error":"ModuleNotFoundError: No module named 'sentencepiece'"},{"fix":"Verify that the `.model` file exists at the provided path, is accessible, and is a valid SentencePiece model file. Double-check the path spelling and file permissions.","cause":"The SentencePiece processor failed to load a model because the specified model file path is incorrect, the file does not exist, or the model file is corrupted.","error":"sentencepiece.SentencePieceError: Cannot open file"},{"fix":"Ensure your input text files are saved with UTF-8 encoding. If not possible, read the file with its correct encoding and then process the resulting string, or try specifying the encoding if the SentencePiece method supports it.","cause":"The input text file being read by SentencePiece (e.g., for training or processing) is not encoded in UTF-8, but SentencePiece expects UTF-8 by default.","error":"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xXX in position Y: invalid start byte"},{"fix":"Confirm that the input file(s) specified in the `input` argument for `SentencePieceTrainer.train` contain valid text data and that their paths are correct and accessible.","cause":"The input file(s) provided to `spm.SentencePieceTrainer.train` for model training are empty, do not exist, or their paths are incorrect, resulting in no data for training.","error":"sentencepiece.SentencePieceError: Input file is empty."}],"ecosystem":"pypi","meta_description":null,"install_score":50,"install_tag":"draft","quickstart_score":0,"quickstart_tag":"stale","pypi_latest":"0.2.1","install_checks":{"last_tested":"2026-05-12","tag":"draft","tag_description":"notable install failures or slow imports","results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":1.6,"import_time_s":0.04,"mem_mb":2.5,"disk_size":"22M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.03,"mem_mb":2.5,"disk_size":"22M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":1.8,"import_time_s":0.06,"mem_mb":2.8,"disk_size":"24M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.07,"mem_mb":2.8,"disk_size":"24M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":1.5,"import_time_s":0.09,"mem_mb":3.5,"disk_size":"15M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.09,"mem_mb":3.5,"disk_size":"15M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":1.5,"import_time_s":0.08,"mem_mb":3.3,"disk_size":"15M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.08,"mem_mb":3.2,"disk_size":"15M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":1.9,"import_time_s":0.04,"mem_mb":2.4,"disk_size":"21M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.04,"mem_mb":2.4,"disk_size":"21M"}]},"quickstart_checks":{"last_tested":"2026-04-24","tag":"stale","tag_description":"widespread failures or data too old to trust","results":[{"runtime":"python:3.10-alpine","exit_code":1},{"runtime":"python:3.10-slim","exit_code":1},{"runtime":"python:3.11-alpine","exit_code":1},{"runtime":"python:3.11-slim","exit_code":1},{"runtime":"python:3.12-alpine","exit_code":1},{"runtime":"python:3.12-slim","exit_code":1},{"runtime":"python:3.13-alpine","exit_code":1},{"runtime":"python:3.13-slim","exit_code":1},{"runtime":"python:3.9-alpine","exit_code":1},{"runtime":"python:3.9-slim","exit_code":1}]}}