{"id":3271,"library":"semchunk","title":"semchunk","description":"semchunk is a Python library for splitting text into smaller chunks while preserving as much local semantic context as possible. It supports advanced features like AI-powered hierarchical chunking, chunk overlapping, and processing Isaacus Legal Graph Schema (ILGS) Documents, working seamlessly with various tokenizers. Actively developed by Isaacus, the library has frequent releases, with version 4.0.0 notably introducing AI chunking and ILGS Document support.","status":"active","version":"4.0.0","language":"en","source_language":"en","source_url":"https://github.com/isaacus-dev/semchunk","tags":["text chunking","nlp","semantic chunking","ai","isaacus","rag","tokenization"],"install":[{"cmd":"pip install semchunk","lang":"bash","label":"Install `semchunk`"}],"dependencies":[{"reason":"Required for AI-powered chunking mode and processing Isaacus Legal Graph Schema (ILGS) Documents.","package":"isaacus","optional":true}],"imports":[{"symbol":"chunkerify","correct":"import semchunk\nchunker = semchunk.chunkerify(...)"}],"quickstart":{"code":"import semchunk\n\n# You can optionally import transformers or tiktoken for specific tokenizers,\n# but they are not direct dependencies of semchunk itself.\n# from transformers import AutoTokenizer\n# import tiktoken\n\nchunk_size = 4 # A low chunk size is used here for demonstration purposes.\n# Keep in mind, `semchunk` does not know how many special tokens, if any,\n# your tokenizer adds to every input, so you may want to deduct the number\n# of special tokens added from your chunk size.\n\ntext = 'The quick brown fox jumps over the lazy dog.'\n\n# `chunkerify` accepts the name of an OpenAI model, Tiktoken encoding, Hugging Face model,\n# or a custom tokenizer/token counter.\nchunker = semchunk.chunkerify('gpt-4', chunk_size) # Using an OpenAI model name\n# Example with a Hugging Face tokenizer (requires `transformers` to be installed):\n# from transformers import AutoTokenizer\n# chunker = semchunk.chunkerify(AutoTokenizer.from_pretrained('bert-base-uncased'), chunk_size)\n\nchunks = chunker(text)\nprint(chunks)\n# Expected output might vary slightly based on the tokenizer and chunk_size, \n# but will be similar to: ['The quick brown', 'fox jumps over', 'the lazy dog.']","lang":"python","description":"Demonstrates how to initialize a `semchunk` chunker with a specified tokenizer (e.g., by model name) and a maximum `chunk_size`, then use the returned callable chunker to split a given text into semantically meaningful segments. This example highlights the use of model names for easy tokenizer integration."},"warnings":[{"fix":"Manually calculate and deduct special token count from your `chunk_size` parameter, or use the `tokenizer_kwargs` argument in `chunkerify()` (available since v4.0.0) to explicitly control tokenizer behavior regarding special tokens.","message":"When specifying `chunk_size`, be aware that `semchunk` does not automatically account for special tokens added by your tokenizer. You should typically deduct the number of special tokens from your desired `chunk_size` to ensure chunks do not exceed the model's actual context window. This critical guidance was removed in v3.0.0 but re-added in v3.1.1 due to its importance for correct usage.","severity":"gotcha","affected_versions":">=3.1.1 (clarification), v3.0.0 (where it was missing)"},{"fix":"Update calls to `semchunk.chunkerify(...)` and the returned `chunker(...)` to use keyword arguments for all parameters after `tokenizer_or_token_counter` and `chunk_size` (for `chunkerify`), and after `text`, `chunk_size`, and `token_counter` (for the returned chunker).","message":"As of version 4.0.0, all arguments to `semchunk.chunkerify()` and the callable `chunker` (returned by `chunkerify`) except for the first two/three arguments (respectively) are now keyword-only. Passing these arguments positionally will raise a `TypeError`.","severity":"breaking","affected_versions":">=4.0.0"},{"fix":"Adjust `chunk_size` expectations or utilize the new `tokenizer_kwargs` argument in `chunkerify()` to explicitly control how special tokens are handled by the underlying tokenizer if the new default behavior is not desired.","message":"In version 4.0.0, `semchunk` changed its default behavior for handling special tokens when using `tiktoken` or `transformers` tokenizers. It now treats special tokens as normal text. Previously, `tiktoken` would raise an error, and `transformers` would treat them as special tokens. This can alter token counts and chunking behavior for texts containing special tokens.","severity":"breaking","affected_versions":">=4.0.0"},{"fix":"Review your chunking outputs with `semchunk` versions 3.2.0 and later if your application relies on specific chunk boundaries, especially for texts with complex whitespace or when using low chunk sizes.","message":"Version 3.2.0 introduced a significant improvement in chunk quality, particularly for low chunk sizes or documents with minimal whitespace, by prioritizing more semantically meaningful split points. Version 3.2.4 also fixed the splitter sorting order. While an improvement, these changes mean the exact chunk boundaries may differ from previous versions, which could impact downstream tasks sensitive to precise chunk content.","severity":"gotcha","affected_versions":">=3.2.0"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}