Chonkie

1.6.2 · active · verified Tue Apr 14

Chonkie is a no-nonsense Python library for text chunking, offering various strategies including recursive, semantic, and AI-powered chunkers. It also supports advanced features like HTML table processing and visualization. Chonkie is actively maintained with frequent minor releases and bug fixes, with the current version being 1.6.2.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the `RecursiveChunker` to break down a sample text into smaller pieces. It's a common and flexible chunking strategy. An commented-out example for `TeraflopAIChunker` is also included, highlighting the need for an API key and optional dependencies.

import os
from chonkie import RecursiveChunker

# Instantiate a chunker. RecursiveChunker is a common choice.
chunker = RecursiveChunker(chunk_size=500, chunk_overlap=50)

text = (
    "Chonkie is a highly efficient and flexible text chunking library in Python. "
    "It provides various strategies for breaking down long documents into smaller, "
    "manageable chunks, which is crucial for many NLP applications like RAG. "
    "The library supports different chunking methods, including recursive, semantic, "
    "and AI-driven approaches, and can handle various input formats like raw text and HTML. "
    "Recent versions have introduced features like HTML table support and CLI tools." 
)

# Chunk the text
chunks = chunker.chunk(text)

print(f"Original text length: {len(text)} characters")
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1} (length {len(chunk)}): {chunk[:100]}...")

# Example with TeraflopAIChunker (requires API key and 'llm' extra)
# from chonkie import TeraflopAIChunker
# teraflop_api_key = os.environ.get('TERAFLOPAI_API_KEY', 'YOUR_TERAFLOPAI_API_KEY')
# if teraflop_api_key != 'YOUR_TERAFLOPAI_API_KEY':
#     try:
#         ai_chunker = TeraflopAIChunker(api_key=teraflop_api_key)
#         ai_chunks = ai_chunker.chunk(text)
#         print(f"\nAI Chunker chunks: {len(ai_chunks)}")
#     except Exception as e:
#         print(f"Could not use TeraflopAIChunker: {e}")

view raw JSON →