Chonkie Core
Chonkie Core is a high-performance Python library for semantic text chunking, powered by a Rust backend for speed. It offers various strategies including delimiter-based, size-constrained, and Savitzky-Golay filter-based semantic splitting. The library is actively developed, with frequent releases adding new features and optimizations.
Common errors
-
ModuleNotFoundError: No module named 'memchunk'
cause The package name was changed from `memchunk` to `chonkie-core` in version `0.5.0`.fixReplace all `import memchunk` statements with `import chonkie_core` (or `from chonkie_core import ...`). Ensure you have `chonkie-core` installed via `pip install chonkie-core`. -
AttributeError: 'Chunker' object has no attribute 'patterns'
cause The `.patterns()` method for multi-byte delimiters was added to the Python bindings in `chonkie-core` version `0.10.1`.fixUpgrade your `chonkie-core` installation to `0.10.1` or newer: `pip install --upgrade chonkie-core`. -
TypeError: argument 'text': 'int' object cannot be interpreted as a string, expected str
cause You are passing an integer or another non-string/non-bytes type as the primary text input to a chunking function.fixEnsure the `text` argument passed to `Chunker` or `chunk` is a string (`str`) type. For example, `chunk(str(my_int_var))`.
Warnings
- breaking The package and module name was changed from `memchunk` to `chonkie-core` in version `0.5.0`. Existing `import memchunk` statements will fail.
- breaking The `.patterns()` API for multi-byte delimiter support was introduced for Python in `v0.10.1`. Attempting to use it on earlier versions will result in an `AttributeError`.
- gotcha The `chunk` and `Chunker` functions primarily operate on string (`str`) inputs. While they can sometimes handle bytes, unexpected behavior or type errors can occur if non-string/bytes types are passed directly.
- gotcha The Savitzky-Golay filter module, including `savgol_filter` and related functions for semantic chunking, was introduced in `v0.9.0`. These functions internally leverage NumPy for efficient array operations, providing zero-copy performance.
Install
-
pip install chonkie-core
Imports
- Chunker
from memchunk import Chunker
from chonkie_core import Chunker
- chunk
from memchunk import chunk
from chonkie_core import chunk
- chunk_offsets
from memchunk import chunk_offsets
from chonkie_core import chunk_offsets
- merge_splits
from memchunk import merge_splits
from chonkie_core import merge_splits
- split_at_delimiters
from memchunk import split_at_delimiters
from chonkie_core import split_at_delimiters
- savgol_filter
from memchunk import savgol_filter
from chonkie_core import savgol_filter
Quickstart
from chonkie_core import Chunker, chunk, chunk_offsets
text = "This is the first sentence. This is the second sentence! And this is the third sentence, with a comma. Finally, the last one. Here is some Japanese: これは日本語のテキストです。句読点も含まれます。"
# Using Chunker class with delimiters and patterns
print("--- Using Chunker ---")
chunks_obj = list(Chunker(text, delimiters="\n.?!", patterns=["。", ",", "!"]))
for c in chunks_obj:
print(f"'{c.text}' (len: {len(c.text)})\nOffset range: {c.offset_range}")
# Using convenience function `chunk`
print("\n--- Using chunk function ---")
for c in chunk(text, delimiters=".", patterns=["。"]): # The 'chunk' function returns Chunk objects
print(f"'{c.text}' (len: {len(c.text)})\nOffset range: {c.offset_range}")
# Getting offsets directly
print("\n--- Using chunk_offsets function ---")
offsets = chunk_offsets(text, delimiters=".", patterns=["。"])
print(f"Offsets: {offsets}")