{"id":7709,"library":"semantic-text-splitter","title":"Semantic Text Splitter","description":"The `semantic-text-splitter` Python library provides advanced text splitting capabilities by leveraging semantic embeddings to create semantically coherent document chunks. It builds upon `sentence-transformers` and `transformers` to offer both character-based and embedding-based splitting. The current version is 0.29.0, with a relatively frequent release cadence, often introducing new features or refinements every few weeks.","status":"active","version":"0.29.0","language":"en","source_language":"en","source_url":"https://github.com/alejandro-ao/semantic-text-splitter","tags":["text-splitting","nlp","embeddings","vector-databases","rag","chunking","ai"],"install":[{"cmd":"pip install semantic-text-splitter","lang":"bash","label":"Core library"},{"cmd":"pip install semantic-text-splitter[gpu]","lang":"bash","label":"With GPU support (e.g., for faster embeddings via PyTorch CUDA)"}],"dependencies":[{"reason":"Required for tokenization (e.g., AutoTokenizer).","package":"transformers","optional":false},{"reason":"Required for embedding generation.","package":"sentence-transformers","optional":false},{"reason":"Often a dependency of sentence-transformers, required for GPU acceleration.","package":"torch","optional":true}],"imports":[{"symbol":"EmbeddingTextSplitter","correct":"from semantic_text_splitter import EmbeddingTextSplitter"},{"note":"Direct import from the package root is preferred and simpler since v0.14.0.","wrong":"from semantic_text_splitter.splitter import CharacterTextSplitter","symbol":"CharacterTextSplitter","correct":"from semantic_text_splitter import CharacterTextSplitter"},{"note":"AutoTokenizer is from the 'transformers' library, not directly from 'semantic-text-splitter'.","wrong":"from semantic_text_splitter import AutoTokenizer","symbol":"AutoTokenizer","correct":"from transformers import AutoTokenizer"}],"quickstart":{"code":"import os\nfrom semantic_text_splitter import EmbeddingTextSplitter\nfrom transformers import AutoTokenizer\n\n# Example text, often a full document\nlong_document_text = (\n    \"The quick brown fox jumps over the lazy dog. \"\n    \"This sentence is a classic example used for typing practice. \"\n    \"However, its semantic content is rather limited. \"\n    \"In natural language processing, we often deal with much longer texts, \"\n    \"requiring sophisticated methods to break them into manageable pieces. \"\n    \"Semantic text splitting aims to keep related ideas together, \"\n    \"even if they are separated by punctuation or line breaks. \"\n    \"This is crucial for retrieval augmented generation (RAG) systems. \"\n    \"By using embeddings, the splitter can understand the meaning of the text \"\n    \"and make informed decisions about where to cut.\" \n    * 5 # Repeat to make it long enough for splitting\n)\n\n# Choose an embedding model (e.g., from Hugging Face Hub)\n# Ensure this model is suitable for your language and task\nmodel_name = \"BAAI/bge-small-en-v1.5\"\n\n# Initialize the tokenizer for the chosen model\n# This is crucial for accurate token counting\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# Initialize the EmbeddingTextSplitter\n# 'threshold' controls semantic similarity: higher = more similar chunks\n# 'max_tokens' defines the maximum size of each chunk\nembedding_splitter = EmbeddingTextSplitter(\n    tokenizer=tokenizer,\n    model_name=model_name,\n    threshold=0.5,\n    max_tokens=256\n)\n\n# Split the document into semantically coherent chunks\nembedding_chunks = embedding_splitter.chunks(long_document_text)\n\nprint(f\"Original text length: {len(long_document_text)} characters\")\nprint(f\"Number of chunks created: {len(embedding_chunks)}\")\nif embedding_chunks:\n    print(f\"First chunk (length {len(embedding_chunks[0])} chars):\\n---\\n{embedding_chunks[0]}\\n---\")\n    print(f\"Last chunk (length {len(embedding_chunks[-1])} chars):\\n---\\n{embedding_chunks[-1]}\\n---\")\n\n# Example of CharacterTextSplitter (simpler, non-semantic)\nfrom semantic_text_splitter import CharacterTextSplitter\ncharacter_splitter = CharacterTextSplitter(\n    tokenizer=tokenizer, \n    chunk_size=256, \n    chunk_overlap=30\n)\nchar_chunks = character_splitter.chunks(long_document_text)\nprint(f\"\\nNumber of character chunks created: {len(char_chunks)}\")\nif char_chunks:\n    print(f\"First char chunk (length {len(char_chunks[0])} chars):\\n---\\n{char_chunks[0]}\\n---\")\n","lang":"python","description":"This quickstart demonstrates how to use `EmbeddingTextSplitter` to divide a long document into semantically related chunks using a pre-trained embedding model and its corresponding tokenizer. It also shows `CharacterTextSplitter` for comparison."},"warnings":[{"fix":"Always initialize your `EmbeddingTextSplitter` with `tokenizer=AutoTokenizer.from_pretrained(model_name)` alongside `model_name`.","message":"Prior to version 0.14.0, `EmbeddingTextSplitter` might have implicitly handled tokenizer loading. Since v0.14.0, a `tokenizer` argument (from `transformers.AutoTokenizer`) is explicitly required during initialization, which is a common source of `TypeError` if not provided.","severity":"gotcha","affected_versions":"<0.14.0 to >=0.14.0"},{"fix":"Ensure you are passing a `model_name` compatible with `sentence-transformers` (e.g., from Hugging Face Hub). If using custom embedding logic, review the library's `_get_embeddings` method for current expected inputs.","message":"Version 0.28.0 refactored how embedding models are loaded, primarily relying on `sentence-transformers` for robustness and direct compatibility. Custom embedding functions or older integration patterns that bypassed `model_name` might no longer work as expected.","severity":"breaking","affected_versions":"<0.28.0 to >=0.28.0"},{"fix":"Reduce `max_tokens`, process text in smaller batches before passing to the splitter, or upgrade your GPU memory. Consider using CPU-only models or smaller models if memory is a significant constraint.","message":"Processing very large documents or using large `max_tokens` with GPU-enabled embedding models can lead to `RuntimeError: CUDA error: out of memory`. This is especially true if running on a limited VRAM GPU.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Experiment with `threshold` values (e.g., 0.3 to 0.7 for common models like BGE) and `max_tokens` on a representative sample of your data to find an optimal balance for your use case.","message":"The `threshold` parameter in `EmbeddingTextSplitter` significantly impacts chunking behavior. A very high `threshold` can lead to many small chunks or even single-sentence chunks, while a very low `threshold` might result in excessively large chunks or not splitting at all, depending on the text's semantic density.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install the `transformers` library: `pip install transformers`","cause":"The `transformers` library, which provides `AutoTokenizer`, is not installed.","error":"ModuleNotFoundError: No module named 'transformers'"},{"fix":"Install the `sentence-transformers` library: `pip install sentence-transformers`","cause":"The `sentence-transformers` library, which provides the embedding models, is not installed.","error":"ModuleNotFoundError: No module named 'sentence_transformers'"},{"fix":"Ensure the `text` argument passed to the `chunks()` method is a non-empty string containing actual content. Add a check `if text.strip():` before splitting.","cause":"You called `splitter.chunks('')` or `splitter.chunks('   ')` with an empty or whitespace-only string.","error":"ValueError: Input text cannot be empty or consists only of whitespace."},{"fix":"Double-check the `model_name` for typos. Verify the model's existence and exact name on Hugging Face Hub (e.g., `https://huggingface.co/BAAI/bge-small-en-v1.5`).","cause":"The `model_name` provided to `AutoTokenizer.from_pretrained()` or `EmbeddingTextSplitter` is incorrect, misspelled, or the model isn't publicly available on Hugging Face Hub.","error":"OSError: Can't load tokenizer for 'some/non-existent-model'. If you were trying to load a tokenizer from a local directory, make sure 'some/non-existent-model' is the correct path to that directory."}]}