Docling Python SDK

2.85.0 · active · verified Thu Apr 09

Docling is a Python SDK and CLI for parsing diverse document formats like PDF, DOCX, HTML, and more, into a unified, structured representation. It simplifies downstream workflows for generative AI applications by understanding page layouts, tables, formulas, and supporting OCR. The library is actively maintained, with frequent updates and a current version of 2.85.0.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to convert a document (from a URL in this example) into a structured DoclingDocument and then export its content to Markdown format. Docling automatically detects the document type and provides a `result` object containing the `DoclingDocument`, conversion `status`, and input details.

import os
from docling.document_converter import DocumentConverter

# Example: Convert a document from a URL and export to Markdown
# Docling supports local file paths, URLs, or file-like objects.
source_url = os.environ.get('DOCLING_EXAMPLE_URL', 'https://arxiv.org/pdf/2408.09869')

# Initialize the DocumentConverter
converter = DocumentConverter()

try:
    # Convert the document
    result = converter.convert(source_url)

    # Check conversion status
    if result.status == 'SUCCESS':
        # Access the structured document and export it to Markdown
        markdown_output = result.document.export_to_markdown()
        print(markdown_output[:500]) # Print first 500 characters
        print("\n... (truncated output)")
    else:
        print(f"Document conversion failed or was partial: {result.status}")
        # You can inspect result.input for details about the source
except Exception as e:
    print(f"An error occurred during conversion: {e}")

view raw JSON →