{"library":"semantic-text-splitter","title":"Semantic Text Splitter","description":"The `semantic-text-splitter` Python library provides advanced text splitting capabilities by leveraging semantic embeddings to create semantically coherent document chunks. It builds upon `sentence-transformers` and `transformers` to offer both character-based and embedding-based splitting. The current version is 0.29.0, with a relatively frequent release cadence, often introducing new features or refinements every few weeks.","language":"python","status":"active","last_verified":"Sun May 17","install":{"commands":["pip install semantic-text-splitter","pip install semantic-text-splitter[gpu]"],"cli":null},"imports":["from semantic_text_splitter import EmbeddingTextSplitter","from semantic_text_splitter import CharacterTextSplitter","from transformers import AutoTokenizer"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import os\nfrom semantic_text_splitter import EmbeddingTextSplitter\nfrom transformers import AutoTokenizer\n\n# Example text, often a full document\nlong_document_text = (\n    \"The quick brown fox jumps over the lazy dog. \"\n    \"This sentence is a classic example used for typing practice. \"\n    \"However, its semantic content is rather limited. \"\n    \"In natural language processing, we often deal with much longer texts, \"\n    \"requiring sophisticated methods to break them into manageable pieces. \"\n    \"Semantic text splitting aims to keep related ideas together, \"\n    \"even if they are separated by punctuation or line breaks. \"\n    \"This is crucial for retrieval augmented generation (RAG) systems. \"\n    \"By using embeddings, the splitter can understand the meaning of the text \"\n    \"and make informed decisions about where to cut.\" \n    * 5 # Repeat to make it long enough for splitting\n)\n\n# Choose an embedding model (e.g., from Hugging Face Hub)\n# Ensure this model is suitable for your language and task\nmodel_name = \"BAAI/bge-small-en-v1.5\"\n\n# Initialize the tokenizer for the chosen model\n# This is crucial for accurate token counting\ntokenizer = AutoTokenizer.from_pretrained(model_name)\n\n# Initialize the EmbeddingTextSplitter\n# 'threshold' controls semantic similarity: higher = more similar chunks\n# 'max_tokens' defines the maximum size of each chunk\nembedding_splitter = EmbeddingTextSplitter(\n    tokenizer=tokenizer,\n    model_name=model_name,\n    threshold=0.5,\n    max_tokens=256\n)\n\n# Split the document into semantically coherent chunks\nembedding_chunks = embedding_splitter.chunks(long_document_text)\n\nprint(f\"Original text length: {len(long_document_text)} characters\")\nprint(f\"Number of chunks created: {len(embedding_chunks)}\")\nif embedding_chunks:\n    print(f\"First chunk (length {len(embedding_chunks[0])} chars):\\n---\\n{embedding_chunks[0]}\\n---\")\n    print(f\"Last chunk (length {len(embedding_chunks[-1])} chars):\\n---\\n{embedding_chunks[-1]}\\n---\")\n\n# Example of CharacterTextSplitter (simpler, non-semantic)\nfrom semantic_text_splitter import CharacterTextSplitter\ncharacter_splitter = CharacterTextSplitter(\n    tokenizer=tokenizer, \n    chunk_size=256, \n    chunk_overlap=30\n)\nchar_chunks = character_splitter.chunks(long_document_text)\nprint(f\"\\nNumber of character chunks created: {len(char_chunks)}\")\nif char_chunks:\n    print(f\"First char chunk (length {len(char_chunks[0])} chars):\\n---\\n{char_chunks[0]}\\n---\")\n","lang":"python","description":"This quickstart demonstrates how to use `EmbeddingTextSplitter` to divide a long document into semantically related chunks using a pre-trained embedding model and its corresponding tokenizer. It also shows `CharacterTextSplitter` for comparison.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-17","installed_version":"0.27.0","pypi_latest":"0.30.1","is_stale":true,"summary":{"python_range":"3.10–3.9","success_rate":50,"avg_install_s":1.9,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"semantic-text-splitter","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"gpu","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"semantic-text-splitter","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"37M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"gpu","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":"37M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"semantic-text-splitter","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"gpu","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"semantic-text-splitter","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"38M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"gpu","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.9,"import_time_s":null,"mem_mb":null,"disk_size":"38M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"semantic-text-splitter","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"gpu","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"semantic-text-splitter","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.6,"import_time_s":null,"mem_mb":null,"disk_size":"30M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"gpu","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"30M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"semantic-text-splitter","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"gpu","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"semantic-text-splitter","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.6,"import_time_s":null,"mem_mb":null,"disk_size":"30M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"gpu","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":1.8,"import_time_s":null,"mem_mb":null,"disk_size":"30M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"semantic-text-splitter","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"gpu","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"semantic-text-splitter","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":2.3,"import_time_s":null,"mem_mb":null,"disk_size":"36M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"gpu","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":2.2,"import_time_s":null,"mem_mb":null,"disk_size":"36M"}]}}