Semantic Text Splitter
The `semantic-text-splitter` Python library provides advanced text splitting capabilities by leveraging semantic embeddings to create semantically coherent document chunks. It builds upon `sentence-transformers` and `transformers` to offer both character-based and embedding-based splitting. The current version is 0.29.0, with a relatively frequent release cadence, often introducing new features or refinements every few weeks.
Common errors
-
ModuleNotFoundError: No module named 'transformers'
cause The `transformers` library, which provides `AutoTokenizer`, is not installed.fixInstall the `transformers` library: `pip install transformers` -
ModuleNotFoundError: No module named 'sentence_transformers'
cause The `sentence-transformers` library, which provides the embedding models, is not installed.fixInstall the `sentence-transformers` library: `pip install sentence-transformers` -
ValueError: Input text cannot be empty or consists only of whitespace.
cause You called `splitter.chunks('')` or `splitter.chunks(' ')` with an empty or whitespace-only string.fixEnsure the `text` argument passed to the `chunks()` method is a non-empty string containing actual content. Add a check `if text.strip():` before splitting. -
OSError: Can't load tokenizer for 'some/non-existent-model'. If you were trying to load a tokenizer from a local directory, make sure 'some/non-existent-model' is the correct path to that directory.
cause The `model_name` provided to `AutoTokenizer.from_pretrained()` or `EmbeddingTextSplitter` is incorrect, misspelled, or the model isn't publicly available on Hugging Face Hub.fixDouble-check the `model_name` for typos. Verify the model's existence and exact name on Hugging Face Hub (e.g., `https://huggingface.co/BAAI/bge-small-en-v1.5`).
Warnings
- gotcha Prior to version 0.14.0, `EmbeddingTextSplitter` might have implicitly handled tokenizer loading. Since v0.14.0, a `tokenizer` argument (from `transformers.AutoTokenizer`) is explicitly required during initialization, which is a common source of `TypeError` if not provided.
- breaking Version 0.28.0 refactored how embedding models are loaded, primarily relying on `sentence-transformers` for robustness and direct compatibility. Custom embedding functions or older integration patterns that bypassed `model_name` might no longer work as expected.
- gotcha Processing very large documents or using large `max_tokens` with GPU-enabled embedding models can lead to `RuntimeError: CUDA error: out of memory`. This is especially true if running on a limited VRAM GPU.
- gotcha The `threshold` parameter in `EmbeddingTextSplitter` significantly impacts chunking behavior. A very high `threshold` can lead to many small chunks or even single-sentence chunks, while a very low `threshold` might result in excessively large chunks or not splitting at all, depending on the text's semantic density.
Install
-
pip install semantic-text-splitter -
pip install semantic-text-splitter[gpu]
Imports
- EmbeddingTextSplitter
from semantic_text_splitter import EmbeddingTextSplitter
- CharacterTextSplitter
from semantic_text_splitter.splitter import CharacterTextSplitter
from semantic_text_splitter import CharacterTextSplitter
- AutoTokenizer
from semantic_text_splitter import AutoTokenizer
from transformers import AutoTokenizer
Quickstart
import os
from semantic_text_splitter import EmbeddingTextSplitter
from transformers import AutoTokenizer
# Example text, often a full document
long_document_text = (
"The quick brown fox jumps over the lazy dog. "
"This sentence is a classic example used for typing practice. "
"However, its semantic content is rather limited. "
"In natural language processing, we often deal with much longer texts, "
"requiring sophisticated methods to break them into manageable pieces. "
"Semantic text splitting aims to keep related ideas together, "
"even if they are separated by punctuation or line breaks. "
"This is crucial for retrieval augmented generation (RAG) systems. "
"By using embeddings, the splitter can understand the meaning of the text "
"and make informed decisions about where to cut."
* 5 # Repeat to make it long enough for splitting
)
# Choose an embedding model (e.g., from Hugging Face Hub)
# Ensure this model is suitable for your language and task
model_name = "BAAI/bge-small-en-v1.5"
# Initialize the tokenizer for the chosen model
# This is crucial for accurate token counting
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Initialize the EmbeddingTextSplitter
# 'threshold' controls semantic similarity: higher = more similar chunks
# 'max_tokens' defines the maximum size of each chunk
embedding_splitter = EmbeddingTextSplitter(
tokenizer=tokenizer,
model_name=model_name,
threshold=0.5,
max_tokens=256
)
# Split the document into semantically coherent chunks
embedding_chunks = embedding_splitter.chunks(long_document_text)
print(f"Original text length: {len(long_document_text)} characters")
print(f"Number of chunks created: {len(embedding_chunks)}")
if embedding_chunks:
print(f"First chunk (length {len(embedding_chunks[0])} chars):\n---\n{embedding_chunks[0]}\n---")
print(f"Last chunk (length {len(embedding_chunks[-1])} chars):\n---\n{embedding_chunks[-1]}\n---")
# Example of CharacterTextSplitter (simpler, non-semantic)
from semantic_text_splitter import CharacterTextSplitter
character_splitter = CharacterTextSplitter(
tokenizer=tokenizer,
chunk_size=256,
chunk_overlap=30
)
char_chunks = character_splitter.chunks(long_document_text)
print(f"\nNumber of character chunks created: {len(char_chunks)}")
if char_chunks:
print(f"First char chunk (length {len(char_chunks[0])} chars):\n---\n{char_chunks[0]}\n---")