Semantic Text Splitter

0.29.0 · active · verified Thu Apr 16

The `semantic-text-splitter` Python library provides advanced text splitting capabilities by leveraging semantic embeddings to create semantically coherent document chunks. It builds upon `sentence-transformers` and `transformers` to offer both character-based and embedding-based splitting. The current version is 0.29.0, with a relatively frequent release cadence, often introducing new features or refinements every few weeks.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `EmbeddingTextSplitter` to divide a long document into semantically related chunks using a pre-trained embedding model and its corresponding tokenizer. It also shows `CharacterTextSplitter` for comparison.

import os
from semantic_text_splitter import EmbeddingTextSplitter
from transformers import AutoTokenizer

# Example text, often a full document
long_document_text = (
    "The quick brown fox jumps over the lazy dog. "
    "This sentence is a classic example used for typing practice. "
    "However, its semantic content is rather limited. "
    "In natural language processing, we often deal with much longer texts, "
    "requiring sophisticated methods to break them into manageable pieces. "
    "Semantic text splitting aims to keep related ideas together, "
    "even if they are separated by punctuation or line breaks. "
    "This is crucial for retrieval augmented generation (RAG) systems. "
    "By using embeddings, the splitter can understand the meaning of the text "
    "and make informed decisions about where to cut." 
    * 5 # Repeat to make it long enough for splitting
)

# Choose an embedding model (e.g., from Hugging Face Hub)
# Ensure this model is suitable for your language and task
model_name = "BAAI/bge-small-en-v1.5"

# Initialize the tokenizer for the chosen model
# This is crucial for accurate token counting
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Initialize the EmbeddingTextSplitter
# 'threshold' controls semantic similarity: higher = more similar chunks
# 'max_tokens' defines the maximum size of each chunk
embedding_splitter = EmbeddingTextSplitter(
    tokenizer=tokenizer,
    model_name=model_name,
    threshold=0.5,
    max_tokens=256
)

# Split the document into semantically coherent chunks
embedding_chunks = embedding_splitter.chunks(long_document_text)

print(f"Original text length: {len(long_document_text)} characters")
print(f"Number of chunks created: {len(embedding_chunks)}")
if embedding_chunks:
    print(f"First chunk (length {len(embedding_chunks[0])} chars):\n---\n{embedding_chunks[0]}\n---")
    print(f"Last chunk (length {len(embedding_chunks[-1])} chars):\n---\n{embedding_chunks[-1]}\n---")

# Example of CharacterTextSplitter (simpler, non-semantic)
from semantic_text_splitter import CharacterTextSplitter
character_splitter = CharacterTextSplitter(
    tokenizer=tokenizer, 
    chunk_size=256, 
    chunk_overlap=30
)
char_chunks = character_splitter.chunks(long_document_text)
print(f"\nNumber of character chunks created: {len(char_chunks)}")
if char_chunks:
    print(f"First char chunk (length {len(char_chunks[0])} chars):\n---\n{char_chunks[0]}\n---")

view raw JSON →