semchunk
semchunk is a Python library for splitting text into smaller chunks while preserving as much local semantic context as possible. It supports advanced features like AI-powered hierarchical chunking, chunk overlapping, and processing Isaacus Legal Graph Schema (ILGS) Documents, working seamlessly with various tokenizers. Actively developed by Isaacus, the library has frequent releases, with version 4.0.0 notably introducing AI chunking and ILGS Document support.
Warnings
- gotcha When specifying `chunk_size`, be aware that `semchunk` does not automatically account for special tokens added by your tokenizer. You should typically deduct the number of special tokens from your desired `chunk_size` to ensure chunks do not exceed the model's actual context window. This critical guidance was removed in v3.0.0 but re-added in v3.1.1 due to its importance for correct usage.
- breaking As of version 4.0.0, all arguments to `semchunk.chunkerify()` and the callable `chunker` (returned by `chunkerify`) except for the first two/three arguments (respectively) are now keyword-only. Passing these arguments positionally will raise a `TypeError`.
- breaking In version 4.0.0, `semchunk` changed its default behavior for handling special tokens when using `tiktoken` or `transformers` tokenizers. It now treats special tokens as normal text. Previously, `tiktoken` would raise an error, and `transformers` would treat them as special tokens. This can alter token counts and chunking behavior for texts containing special tokens.
- gotcha Version 3.2.0 introduced a significant improvement in chunk quality, particularly for low chunk sizes or documents with minimal whitespace, by prioritizing more semantically meaningful split points. Version 3.2.4 also fixed the splitter sorting order. While an improvement, these changes mean the exact chunk boundaries may differ from previous versions, which could impact downstream tasks sensitive to precise chunk content.
Install
-
pip install semchunk
Imports
- chunkerify
import semchunk chunker = semchunk.chunkerify(...)
Quickstart
import semchunk
# You can optionally import transformers or tiktoken for specific tokenizers,
# but they are not direct dependencies of semchunk itself.
# from transformers import AutoTokenizer
# import tiktoken
chunk_size = 4 # A low chunk size is used here for demonstration purposes.
# Keep in mind, `semchunk` does not know how many special tokens, if any,
# your tokenizer adds to every input, so you may want to deduct the number
# of special tokens added from your chunk size.
text = 'The quick brown fox jumps over the lazy dog.'
# `chunkerify` accepts the name of an OpenAI model, Tiktoken encoding, Hugging Face model,
# or a custom tokenizer/token counter.
chunker = semchunk.chunkerify('gpt-4', chunk_size) # Using an OpenAI model name
# Example with a Hugging Face tokenizer (requires `transformers` to be installed):
# from transformers import AutoTokenizer
# chunker = semchunk.chunkerify(AutoTokenizer.from_pretrained('bert-base-uncased'), chunk_size)
chunks = chunker(text)
print(chunks)
# Expected output might vary slightly based on the tokenizer and chunk_size,
# but will be similar to: ['The quick brown', 'fox jumps over', 'the lazy dog.']