SentencePiece

0.2.1 · active · verified Sat Mar 28

SentencePiece is an unsupervised text tokenizer and detokenizer, primarily designed for Neural Network-based text generation systems where the vocabulary size is predetermined. It implements subword units like Byte-Pair Encoding (BPE) and Unigram Language Model, capable of training directly from raw sentences without pre-tokenization. The library is actively maintained with regular updates. The current version is 0.2.1.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to train a SentencePiece model from a text file, load the trained model, and then use it to encode text into subword pieces and IDs, and decode IDs back to text. The `input` parameter for training expects a file path.

import sentencepiece as spm
import os

# Create a dummy text file for training
corpus_content = "This is a test sentence. SentencePiece is awesome.\nAnother example sentence for training."
corpus_file = "corpus.txt"
with open(corpus_file, "w", encoding="utf-8") as f:
    f.write(corpus_content)

model_prefix = "m_model"
vocab_size = 8000

# Train a SentencePiece model
spm.SentencePieceTrainer.train(
    input=corpus_file,
    model_prefix=model_prefix,
    vocab_size=vocab_size
)

# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load(f"{model_prefix}.model")

# Encode text
text_to_encode = "SentencePiece tokenization is powerful."
encoded_pieces = sp.encode_as_pieces(text_to_encode)
encoded_ids = sp.encode_as_ids(text_to_encode)

print(f"Original text: {text_to_encode}")
print(f"Encoded pieces: {encoded_pieces}")
print(f"Encoded IDs: {encoded_ids}")

# Decode IDs back to text
decoded_text = sp.decode_ids(encoded_ids)
print(f"Decoded text: {decoded_text}")

# Clean up generated model files
os.remove(f"{model_prefix}.model")
os.remove(f"{model_prefix}.vocab")
os.remove(corpus_file)

view raw JSON →