Kreuzberg Document Intelligence
Kreuzberg is a high-performance Python library for document intelligence, enabling extraction of text, metadata, and structured data from PDFs, Office documents, images, and over 88 other formats. It leverages a Rust core for significant speed improvements (10-50x faster) compared to pure Python alternatives. The current version is 4.8.5, with an active release cadence, typically releasing minor updates every few weeks.
Common errors
-
ModuleNotFoundError: No module named 'kreuzberg'
cause The Kreuzberg library is not installed in your current Python environment.fixRun `pip install kreuzberg` to install the library. -
TypeError: extract() got an unexpected keyword argument 'config'
cause This error typically indicates a version mismatch where the `extract` function's signature has changed, or you are using an older example with a newer library version (or vice-versa). The quickstart uses `config=config`.fixEnsure you are using the latest `kreuzberg` version and refer to the official documentation or the quickstart code for correct `extract` function usage. The current API expects `extract(file_path, config=...)`. -
ERROR: Package 'kreuzberg' requires a different Python: 3.9.x not in '>=3.10'
cause You are attempting to install or run Kreuzberg on a Python version older than 3.10, which is not supported.fixUpgrade your Python installation to 3.10 or higher, or activate a virtual environment that uses a supported Python version. -
My HTML output from Kreuzberg is suddenly styled with CSS, but I wanted plain HTML.
cause Starting with `v4.8.1`, HTML output gained default styling via `HtmlOutputConfig`. If you didn't specify `html_output`, it now defaults to a styled theme.fixTo explicitly get unstyled HTML, set `html_output=HtmlOutputConfig(theme="unstyled")` in your `ExtractionConfig`.
Warnings
- breaking Kreuzberg requires Python 3.10 or newer. Installing or running the library on older Python versions will result in errors.
- gotcha When extracting in HTML format, versions 4.8.1 and later introduced default styling. If you were expecting plain, unstyled HTML, your output will now include CSS and semantic classes.
- gotcha Versions prior to 4.8.2 had a bug where legitimate repeated content (e.g., brand names, headers) in PDFs could be stripped, even if `strip_repeating_text` was not enabled or intended.
- gotcha Users on macOS ARM64 systems (e.g., M1/M2/M3 Macs) using `kreuzberg` versions older than `v4.7.3` might experience a `SIGBUS` (Bus error: 10) crash when processing archive files (ZIP, 7Z, TAR, GZIP).
Install
-
pip install kreuzberg
Imports
- extract
from kreuzberg import extract
- ExtractionConfig
from kreuzberg import ExtractionConfig
- OutputFormat
from kreuzberg import OutputFormat
- HtmlOutputConfig
from kreuzberg import HtmlOutputConfig
Quickstart
import os
from kreuzberg import extract, ExtractionConfig, OutputFormat, HtmlOutputConfig
# Create a dummy file for demonstration
with open("example.txt", "w") as f:
f.write("This is a test document for Kreuzberg extraction.")
# Example 1: Basic text extraction
config_text = ExtractionConfig(
output_format=OutputFormat.TEXT
)
result_text = extract("example.txt", config=config_text)
print("--- Text Extraction ---")
print(result_text.text)
# Example 2: HTML extraction with a specific theme
config_html = ExtractionConfig(
output_format=OutputFormat.HTML,
html_output=HtmlOutputConfig(theme="github")
)
result_html = extract("example.txt", config=config_html)
print("\n--- HTML Extraction (GitHub theme) ---")
print(result_html.html)
os.remove("example.txt") # Clean up the dummy file