{"id":9875,"library":"kreuzberg","title":"Kreuzberg Document Intelligence","description":"Kreuzberg is a high-performance Python library for document intelligence, enabling extraction of text, metadata, and structured data from PDFs, Office documents, images, and over 88 other formats. It leverages a Rust core for significant speed improvements (10-50x faster) compared to pure Python alternatives. The current version is 4.8.5, with an active release cadence, typically releasing minor updates every few weeks.","status":"active","version":"4.8.5","language":"en","source_language":"en","source_url":"https://github.com/kreuzberg-dev/kreuzberg","tags":["document-intelligence","pdf","ocr","llm","extraction","rust-powered","data-extraction"],"install":[{"cmd":"pip install kreuzberg","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"symbol":"extract","correct":"from kreuzberg import extract"},{"symbol":"ExtractionConfig","correct":"from kreuzberg import ExtractionConfig"},{"symbol":"OutputFormat","correct":"from kreuzberg import OutputFormat"},{"symbol":"HtmlOutputConfig","correct":"from kreuzberg import HtmlOutputConfig"}],"quickstart":{"code":"import os\nfrom kreuzberg import extract, ExtractionConfig, OutputFormat, HtmlOutputConfig\n\n# Create a dummy file for demonstration\nwith open(\"example.txt\", \"w\") as f:\n    f.write(\"This is a test document for Kreuzberg extraction.\")\n\n# Example 1: Basic text extraction\nconfig_text = ExtractionConfig(\n    output_format=OutputFormat.TEXT\n)\nresult_text = extract(\"example.txt\", config=config_text)\nprint(\"--- Text Extraction ---\")\nprint(result_text.text)\n\n# Example 2: HTML extraction with a specific theme\nconfig_html = ExtractionConfig(\n    output_format=OutputFormat.HTML,\n    html_output=HtmlOutputConfig(theme=\"github\")\n)\nresult_html = extract(\"example.txt\", config=config_html)\nprint(\"\\n--- HTML Extraction (GitHub theme) ---\")\nprint(result_html.html)\n\nos.remove(\"example.txt\") # Clean up the dummy file","lang":"python","description":"This quickstart demonstrates how to perform basic text and styled HTML extraction using Kreuzberg's `extract` function with `ExtractionConfig` and `OutputFormat`. It shows how to specify the output format and customize HTML output with `HtmlOutputConfig`."},"warnings":[{"fix":"Upgrade your Python environment to 3.10 or a more recent version (e.g., 3.11, 3.12).","message":"Kreuzberg requires Python 3.10 or newer. Installing or running the library on older Python versions will result in errors.","severity":"breaking","affected_versions":"<4.0.0 (previous major versions might have supported older Pythons) and all versions >=4.0.0"},{"fix":"To get unstyled HTML, explicitly set `html_output=HtmlOutputConfig(theme=\"unstyled\")` in your `ExtractionConfig`.","message":"When extracting in HTML format, versions 4.8.1 and later introduced default styling. If you were expecting plain, unstyled HTML, your output will now include CSS and semantic classes.","severity":"gotcha","affected_versions":">=4.8.1"},{"fix":"Upgrade to `kreuzberg v4.8.2` or newer to resolve the issue with overly aggressive content stripping.","message":"Versions prior to 4.8.2 had a bug where legitimate repeated content (e.g., brand names, headers) in PDFs could be stripped, even if `strip_repeating_text` was not enabled or intended.","severity":"gotcha","affected_versions":"<4.8.2"},{"fix":"Upgrade `kreuzberg` to `v4.7.3` or a newer version to fix the archive extraction crash.","message":"Users on macOS ARM64 systems (e.g., M1/M2/M3 Macs) using `kreuzberg` versions older than `v4.7.3` might experience a `SIGBUS` (Bus error: 10) crash when processing archive files (ZIP, 7Z, TAR, GZIP).","severity":"gotcha","affected_versions":"<4.7.3"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Run `pip install kreuzberg` to install the library.","cause":"The Kreuzberg library is not installed in your current Python environment.","error":"ModuleNotFoundError: No module named 'kreuzberg'"},{"fix":"Ensure you are using the latest `kreuzberg` version and refer to the official documentation or the quickstart code for correct `extract` function usage. The current API expects `extract(file_path, config=...)`.","cause":"This error typically indicates a version mismatch where the `extract` function's signature has changed, or you are using an older example with a newer library version (or vice-versa). The quickstart uses `config=config`.","error":"TypeError: extract() got an unexpected keyword argument 'config'"},{"fix":"Upgrade your Python installation to 3.10 or higher, or activate a virtual environment that uses a supported Python version.","cause":"You are attempting to install or run Kreuzberg on a Python version older than 3.10, which is not supported.","error":"ERROR: Package 'kreuzberg' requires a different Python: 3.9.x not in '>=3.10'"},{"fix":"To explicitly get unstyled HTML, set `html_output=HtmlOutputConfig(theme=\"unstyled\")` in your `ExtractionConfig`.","cause":"Starting with `v4.8.1`, HTML output gained default styling via `HtmlOutputConfig`. If you didn't specify `html_output`, it now defaults to a styled theme.","error":"My HTML output from Kreuzberg is suddenly styled with CSS, but I wanted plain HTML."}]}