Docling Parse

5.8.0 · active · verified Sat Apr 11

Docling Parse is a Python package designed to extract text, paths, and bitmap images along with their precise coordinates from programmatic PDFs. It serves as a core component within the broader Docling PDF conversion ecosystem. The library is actively maintained with frequent releases, including minor and patch versions, as observed from its recent activity.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the `DoclingPdfParser`, load a PDF document, and iterate through its pages to extract text at the word level, including bounding box coordinates. It also shows the import paths for necessary components. For a runnable example, ensure you replace `"path/to/your/document.pdf"` with a valid PDF file path. The example also briefly mentions rendering pages as images.

import os
from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument

# Create a dummy PDF file for demonstration
# In a real scenario, you'd have an actual PDF file path
# This simple quickstart cannot create a real PDF to parse, 
# so we'll use a placeholder and note the expected input.

# Replace "path/to/your/document.pdf" with an actual PDF file path
pdf_file_path = "path/to/your/document.pdf"

# Ensure the PDF file exists for a real-world execution
# For this example, we'll just demonstrate the API calls.
if not os.path.exists(pdf_file_path):
    print(f"Warning: PDF file not found at '{pdf_file_path}'. This example requires a valid PDF.")
    print("Please replace 'path/to/your/document.pdf' with an actual path to a PDF.")
    # Exit or mock for testing purposes if no real PDF is available
    # For a runnable example, a simple PDF is required.
    # Skipping parsing for non-existent file.
else:
    parser = DoclingPdfParser()
    # Load the PDF document
    pdf_doc: PdfDocument = parser.load(path_or_stream=pdf_file_path)

    # Iterate over pages and extract words
    print(f"Processing PDF: {pdf_file_path}")
    for page_no, pred_page in pdf_doc.iterate_pages():
        print(f"\n--- Page {page_no + 1} ---")
        # Iterate over the word-cells on the page
        for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
            print(f"Rect: {word.rect}, Text: '{word.text}'")
        
        # Optionally, render the page as an image (requires Pillow)
        # img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
        # img.show() # This would open the image if Pillow is installed

view raw JSON →