OpenDataLoader PDF

raw JSON →
2.4.3 verified Sat May 09 auth: no python

A Python wrapper for the opendataloader-pdf Java CLI that extracts structured content and metadata from PDFs, supporting accessibility tags, tables, headings, and strikethrough text. Current version 2.4.3, requires Python >=3.10, released every few months.

pip install opendataloader-pdf
error java.lang.UnsupportedClassVersionError: org/opendataloader/pdf/CLI has been compiled by a more recent version of the Java Runtime (class file version 61.0), this version of the Java Runtime only recognizes class file versions up to 55.0
cause Java version too old (< 17). The CLI requires Java 17+ as of v2.4.0.
fix
Update Java to JDK 17 or later.
error opendataloader_pdf.exceptions.LicenseError: No valid license found. Please set OPENDATALOADER_API_KEY or pass api_key.
cause Missing API key or environment variable.
fix
Set the OPENDATALOADER_API_KEY environment variable or pass api_key to the constructor.
breaking The `-–hybrid-fallback` default changed to `false` in v2.0.1, causing hybrid extraction to fail fast instead of falling back to rule-based extraction.
fix Set `hybrid_fallback=True` explicitly if you want the old fallback behavior.
gotcha The library requires Java 11+ at runtime. If Java is missing or too old, extraction fails with a subprocess error.
fix Install Java 11+ and ensure `java` is on PATH.
deprecated The old `–-extract-text` flag is deprecated in v2.3.0 in favor of `–-output-format text`.
fix Use `output_format='text'` instead of `extract_text=True`.

Initialize the extractor with an API key and extract content from a PDF file.

from opendataloader_pdf import OpenDataLoaderPDF

loader = OpenDataLoaderPDF(api_key=os.environ.get('API_KEY', ''))
with open('document.pdf', 'rb') as f:
    result = loader.extract(f)
print(result['content'][:200])