Chonkie
Chonkie is a no-nonsense Python library for text chunking, offering various strategies including recursive, semantic, and AI-powered chunkers. It also supports advanced features like HTML table processing and visualization. Chonkie is actively maintained with frequent minor releases and bug fixes, with the current version being 1.6.2.
Warnings
- breaking Chonkie v1.5.0 dropped support for Python 3.9. Users on Python 3.9 must upgrade their Python version to 3.10 or newer to use Chonkie v1.5.0 or later.
- gotcha Many advanced chunkers (e.g., OpenAI, TeraflopAI, LangChain-based) and visualization tools have optional dependencies that are not installed with `pip install chonkie`. Attempting to use these features without the correct dependencies will result in `ModuleNotFoundError`.
- gotcha Before v1.5.5, importing the `chonkie` library could fail with a `ModuleNotFoundError` if `openai` was not installed, even if you did not intend to use OpenAI-specific features. This was due to non-lazy imports.
- gotcha The `TeraflopAIChunker` (introduced in v1.6.2) requires an API key for its service. Without a valid API key, initialization or chunking attempts will fail.
- gotcha Chonkie migrated its performance-critical components from Cython to Rust in v1.5.4. While this is largely an internal change, it might affect build environments or specific performance characteristics for advanced users compiling from source.
Install
-
pip install chonkie -
pip install 'chonkie[all]' -
pip install 'chonkie[llm]'
Imports
- RecursiveChunker
from chonkie import RecursiveChunker
- TeraflopAIChunker
from chonkie import TeraflopAIChunker
- Visualizer
from chonkie import Visualizer
- FastChunker
from chonkie import FastChunker
- LateChunker
from chonkie import LateChunker
Quickstart
import os
from chonkie import RecursiveChunker
# Instantiate a chunker. RecursiveChunker is a common choice.
chunker = RecursiveChunker(chunk_size=500, chunk_overlap=50)
text = (
"Chonkie is a highly efficient and flexible text chunking library in Python. "
"It provides various strategies for breaking down long documents into smaller, "
"manageable chunks, which is crucial for many NLP applications like RAG. "
"The library supports different chunking methods, including recursive, semantic, "
"and AI-driven approaches, and can handle various input formats like raw text and HTML. "
"Recent versions have introduced features like HTML table support and CLI tools."
)
# Chunk the text
chunks = chunker.chunk(text)
print(f"Original text length: {len(text)} characters")
print(f"Number of chunks: {len(chunks)}")
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} (length {len(chunk)}): {chunk[:100]}...")
# Example with TeraflopAIChunker (requires API key and 'llm' extra)
# from chonkie import TeraflopAIChunker
# teraflop_api_key = os.environ.get('TERAFLOPAI_API_KEY', 'YOUR_TERAFLOPAI_API_KEY')
# if teraflop_api_key != 'YOUR_TERAFLOPAI_API_KEY':
# try:
# ai_chunker = TeraflopAIChunker(api_key=teraflop_api_key)
# ai_chunks = ai_chunker.chunk(text)
# print(f"\nAI Chunker chunks: {len(ai_chunks)}")
# except Exception as e:
# print(f"Could not use TeraflopAIChunker: {e}")