Kreuzberg Document Intelligence

4.8.5 · active · verified Fri Apr 17

Kreuzberg is a high-performance Python library for document intelligence, enabling extraction of text, metadata, and structured data from PDFs, Office documents, images, and over 88 other formats. It leverages a Rust core for significant speed improvements (10-50x faster) compared to pure Python alternatives. The current version is 4.8.5, with an active release cadence, typically releasing minor updates every few weeks.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to perform basic text and styled HTML extraction using Kreuzberg's `extract` function with `ExtractionConfig` and `OutputFormat`. It shows how to specify the output format and customize HTML output with `HtmlOutputConfig`.

import os
from kreuzberg import extract, ExtractionConfig, OutputFormat, HtmlOutputConfig

# Create a dummy file for demonstration
with open("example.txt", "w") as f:
    f.write("This is a test document for Kreuzberg extraction.")

# Example 1: Basic text extraction
config_text = ExtractionConfig(
    output_format=OutputFormat.TEXT
)
result_text = extract("example.txt", config=config_text)
print("--- Text Extraction ---")
print(result_text.text)

# Example 2: HTML extraction with a specific theme
config_html = ExtractionConfig(
    output_format=OutputFormat.HTML,
    html_output=HtmlOutputConfig(theme="github")
)
result_html = extract("example.txt", config=config_html)
print("\n--- HTML Extraction (GitHub theme) ---")
print(result_html.html)

os.remove("example.txt") # Clean up the dummy file

view raw JSON →