Chonkie Core

0.10.1 · active · verified Fri Apr 17

Chonkie Core is a high-performance Python library for semantic text chunking, powered by a Rust backend for speed. It offers various strategies including delimiter-based, size-constrained, and Savitzky-Golay filter-based semantic splitting. The library is actively developed, with frequent releases adding new features and optimizations.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates basic text chunking using the Chunker class for customizable splitting, the `chunk` convenience function, and retrieving character offsets with `chunk_offsets`. It includes examples using both ASCII and multi-byte delimiters/patterns.

from chonkie_core import Chunker, chunk, chunk_offsets

text = "This is the first sentence. This is the second sentence! And this is the third sentence, with a comma. Finally, the last one. Here is some Japanese: これは日本語のテキストです。句読点も含まれます。"

# Using Chunker class with delimiters and patterns
print("--- Using Chunker ---")
chunks_obj = list(Chunker(text, delimiters="\n.?!", patterns=["。", ",", "!"]))
for c in chunks_obj:
    print(f"'{c.text}' (len: {len(c.text)})\nOffset range: {c.offset_range}")

# Using convenience function `chunk`
print("\n--- Using chunk function ---")
for c in chunk(text, delimiters=".", patterns=["。"]): # The 'chunk' function returns Chunk objects
    print(f"'{c.text}' (len: {len(c.text)})\nOffset range: {c.offset_range}")

# Getting offsets directly
print("\n--- Using chunk_offsets function ---")
offsets = chunk_offsets(text, delimiters=".", patterns=["。"])
print(f"Offsets: {offsets}")

view raw JSON →