Docling Python SDK
Docling is a Python SDK and CLI for parsing diverse document formats like PDF, DOCX, HTML, and more, into a unified, structured representation. It simplifies downstream workflows for generative AI applications by understanding page layouts, tables, formulas, and supporting OCR. The library is actively maintained, with frequent updates and a current version of 2.85.0.
Warnings
- breaking Python 3.9 support was dropped in Docling version 2.70.0. Users on Python 3.9 or older must upgrade their Python environment.
- gotcha The `convert()` method returns a `result` object, not the `DoclingDocument` directly. The document, its status, and input information are encapsulated within this `result` object.
- gotcha Integrating Docling into multi-threaded applications may require careful resource management to avoid thread-safety issues, especially with backend resources like pypdfium, which had previous fixes related to thread-unsafe closures.
Install
-
pip install docling
Imports
- DocumentConverter
from docling.document_converter import DocumentConverter
- DoclingDocument
from docling.document import DoclingDocument
Quickstart
import os
from docling.document_converter import DocumentConverter
# Example: Convert a document from a URL and export to Markdown
# Docling supports local file paths, URLs, or file-like objects.
source_url = os.environ.get('DOCLING_EXAMPLE_URL', 'https://arxiv.org/pdf/2408.09869')
# Initialize the DocumentConverter
converter = DocumentConverter()
try:
# Convert the document
result = converter.convert(source_url)
# Check conversion status
if result.status == 'SUCCESS':
# Access the structured document and export it to Markdown
markdown_output = result.document.export_to_markdown()
print(markdown_output[:500]) # Print first 500 characters
print("\n... (truncated output)")
else:
print(f"Document conversion failed or was partial: {result.status}")
# You can inspect result.input for details about the source
except Exception as e:
print(f"An error occurred during conversion: {e}")