Docling Parse
Docling Parse is a Python package designed to extract text, paths, and bitmap images along with their precise coordinates from programmatic PDFs. It serves as a core component within the broader Docling PDF conversion ecosystem. The library is actively maintained with frequent releases, including minor and patch versions, as observed from its recent activity.
Warnings
- breaking With the introduction of `docling-parse` v5, previous parsing backends (especially those integrated directly into the `docling` parent project) were deprecated. Users migrating from older `docling` versions (pre-2.73.1) relying on internal parser implementations may need to update their code to use the `docling-parse` v5 API explicitly.
- gotcha The `docling-parse` library requires Python 3.10 or higher. Installations on older Python versions will fail or result in unexpected behavior.
- gotcha Parsing large PDF documents 'in one go' using `parser.parse_pdf_from_key()` (from older API) or similar memory-intensive methods can consume significant memory. The recommended approach for memory optimization is to process PDFs page by page.
- gotcha Malformed or broken PDF documents can lead to parsing errors or infinite loops. Recent fixes (v5.3.4, v5.6.2) addressed issues like 'Robustify parse of broken pdfs' and 'Prevent infinite loop in TOC extraction with circular PDF refererences'. [cite: 23, 246 (from prompt)]
Install
-
pip install docling-parse
Imports
- DoclingPdfParser
from docling_parse.pdf_parser import DoclingPdfParser
- PdfDocument
from docling_parse.pdf_parser import PdfDocument
- TextCellUnit
from docling_core.types.doc.page import TextCellUnit
- pdf_parser_v2
from docling_parse.pdf_parser import DoclingPdfParser
Quickstart
import os
from docling_core.types.doc.page import TextCellUnit
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument
# Create a dummy PDF file for demonstration
# In a real scenario, you'd have an actual PDF file path
# This simple quickstart cannot create a real PDF to parse,
# so we'll use a placeholder and note the expected input.
# Replace "path/to/your/document.pdf" with an actual PDF file path
pdf_file_path = "path/to/your/document.pdf"
# Ensure the PDF file exists for a real-world execution
# For this example, we'll just demonstrate the API calls.
if not os.path.exists(pdf_file_path):
print(f"Warning: PDF file not found at '{pdf_file_path}'. This example requires a valid PDF.")
print("Please replace 'path/to/your/document.pdf' with an actual path to a PDF.")
# Exit or mock for testing purposes if no real PDF is available
# For a runnable example, a simple PDF is required.
# Skipping parsing for non-existent file.
else:
parser = DoclingPdfParser()
# Load the PDF document
pdf_doc: PdfDocument = parser.load(path_or_stream=pdf_file_path)
# Iterate over pages and extract words
print(f"Processing PDF: {pdf_file_path}")
for page_no, pred_page in pdf_doc.iterate_pages():
print(f"\n--- Page {page_no + 1} ---")
# Iterate over the word-cells on the page
for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):
print(f"Rect: {word.rect}, Text: '{word.text}'")
# Optionally, render the page as an image (requires Pillow)
# img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)
# img.show() # This would open the image if Pillow is installed