semchunk

4.0.0 · active · verified Sat Apr 11

semchunk is a Python library for splitting text into smaller chunks while preserving as much local semantic context as possible. It supports advanced features like AI-powered hierarchical chunking, chunk overlapping, and processing Isaacus Legal Graph Schema (ILGS) Documents, working seamlessly with various tokenizers. Actively developed by Isaacus, the library has frequent releases, with version 4.0.0 notably introducing AI chunking and ILGS Document support.

Warnings

Install

Imports

Quickstart

Demonstrates how to initialize a `semchunk` chunker with a specified tokenizer (e.g., by model name) and a maximum `chunk_size`, then use the returned callable chunker to split a given text into semantically meaningful segments. This example highlights the use of model names for easy tokenizer integration.

import semchunk

# You can optionally import transformers or tiktoken for specific tokenizers,
# but they are not direct dependencies of semchunk itself.
# from transformers import AutoTokenizer
# import tiktoken

chunk_size = 4 # A low chunk size is used here for demonstration purposes.
# Keep in mind, `semchunk` does not know how many special tokens, if any,
# your tokenizer adds to every input, so you may want to deduct the number
# of special tokens added from your chunk size.

text = 'The quick brown fox jumps over the lazy dog.'

# `chunkerify` accepts the name of an OpenAI model, Tiktoken encoding, Hugging Face model,
# or a custom tokenizer/token counter.
chunker = semchunk.chunkerify('gpt-4', chunk_size) # Using an OpenAI model name
# Example with a Hugging Face tokenizer (requires `transformers` to be installed):
# from transformers import AutoTokenizer
# chunker = semchunk.chunkerify(AutoTokenizer.from_pretrained('bert-base-uncased'), chunk_size)

chunks = chunker(text)
print(chunks)
# Expected output might vary slightly based on the tokenizer and chunk_size, 
# but will be similar to: ['The quick brown', 'fox jumps over', 'the lazy dog.']

view raw JSON →