Chunkr AI Python Client
Chunkr AI provides a Python client for its open-source document intelligence platform, offering API services for document layout analysis, OCR, and semantic chunking. It transforms complex documents like PDFs, PPTs, Word files, and images into structured, RAG/LLM-ready data, aiming for high-quality output and improved AI application performance. The current version is 0.3.7, and the project shows active development with regular updates and blog posts on new features and models.
Warnings
- gotcha The Python SDK is currently in alpha and requires the `--pre` flag for installation. This indicates that the API might be subject to changes before a stable release.
- breaking There are two distinct versions: an open-source AGPL self-hosted version and a fully managed Cloud API. They use different underlying models (community/open-source vs. proprietary in-house), leading to differences in accuracy, speed, and available features (e.g., Excel support is Cloud API exclusive).
- gotcha API key is required for authentication with the Chunkr AI Cloud API. Failing to provide a valid key will result in authentication errors.
- gotcha Suboptimal chunking strategies can lead to increased AI costs, reduced retrieval accuracy, and inconsistent LLM responses. While Chunkr aims for intelligent chunking, users should be aware of how different strategies impact their RAG systems.
Install
-
pip install chunkr-ai --pre
Imports
- Chunkr
from chunkr_ai import Chunkr
- ChunkProcessing
from chunkr_ai.models import ChunkProcessing
- Configuration
from chunkr_ai.models import Configuration
- Tokenizer
from chunkr_ai.models import Tokenizer
Quickstart
import os
from chunkr_ai import Chunkr
from chunkr_ai.models import ChunkProcessing, Configuration, Tokenizer
# Ensure your Chunkr API key is set as an environment variable CHUNKR_API_KEY
api_key = os.environ.get('CHUNKR_API_KEY', '')
if not api_key:
print("Warning: CHUNKR_API_KEY environment variable not set. The API call will likely fail.")
chunkr = Chunkr(api_key=api_key)
# Example of processing a document (replace with your document URL or file path)
# This example uses default chunking strategies.
try:
task = chunkr.parse_document(file_url="https://example.com/document.pdf")
print(f"Document processing task submitted with ID: {task.task_id}")
# You can poll for the task status or set up webhooks
# For a simple quickstart, we'll just acknowledge submission.
print("Check Chunkr AI dashboard or use get_task_output for results.")
except Exception as e:
print(f"An error occurred: {e}")