{"id":2935,"library":"docling-parse","title":"Docling Parse","description":"Docling Parse is a Python package designed to extract text, paths, and bitmap images along with their precise coordinates from programmatic PDFs. It serves as a core component within the broader Docling PDF conversion ecosystem. The library is actively maintained with frequent releases, including minor and patch versions, as observed from its recent activity.","status":"active","version":"5.8.0","language":"en","source_language":"en","source_url":"https://github.com/docling-project/docling-parse","tags":["pdf","parsing","text extraction","coordinates","document analysis","image extraction","pdf conversion"],"install":[{"cmd":"pip install docling-parse","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Provides core data types and structures for Docling.","package":"docling-core","optional":false},{"reason":"Required for image rendering capabilities.","package":"pillow","optional":false},{"reason":"Used for data validation and settings management.","package":"pydantic","optional":false},{"reason":"Used for table formatting in some outputs.","package":"tabulate","optional":false}],"imports":[{"note":"Primary class for PDF parsing.","symbol":"DoclingPdfParser","correct":"from docling_parse.pdf_parser import DoclingPdfParser"},{"note":"Represents the parsed PDF document structure.","symbol":"PdfDocument","correct":"from docling_parse.pdf_parser import PdfDocument"},{"note":"Enum for specifying text unit granularity (char, word, line).","symbol":"TextCellUnit","correct":"from docling_core.types.doc.page import TextCellUnit"},{"note":"Older versions of `docling-parse` or its parent `docling` might have used `pdf_parser_v2`. The current approach is to import `DoclingPdfParser` from `docling_parse.pdf_parser` directly.","wrong":"from docling_parse.docling_parse import pdf_parser_v2","symbol":"pdf_parser_v2","correct":"from docling_parse.pdf_parser import DoclingPdfParser"}],"quickstart":{"code":"import os\nfrom docling_core.types.doc.page import TextCellUnit\nfrom docling_parse.pdf_parser import DoclingPdfParser, PdfDocument\n\n# Create a dummy PDF file for demonstration\n# In a real scenario, you'd have an actual PDF file path\n# This simple quickstart cannot create a real PDF to parse, \n# so we'll use a placeholder and note the expected input.\n\n# Replace \"path/to/your/document.pdf\" with an actual PDF file path\npdf_file_path = \"path/to/your/document.pdf\"\n\n# Ensure the PDF file exists for a real-world execution\n# For this example, we'll just demonstrate the API calls.\nif not os.path.exists(pdf_file_path):\n    print(f\"Warning: PDF file not found at '{pdf_file_path}'. This example requires a valid PDF.\")\n    print(\"Please replace 'path/to/your/document.pdf' with an actual path to a PDF.\")\n    # Exit or mock for testing purposes if no real PDF is available\n    # For a runnable example, a simple PDF is required.\n    # Skipping parsing for non-existent file.\nelse:\n    parser = DoclingPdfParser()\n    # Load the PDF document\n    pdf_doc: PdfDocument = parser.load(path_or_stream=pdf_file_path)\n\n    # Iterate over pages and extract words\n    print(f\"Processing PDF: {pdf_file_path}\")\n    for page_no, pred_page in pdf_doc.iterate_pages():\n        print(f\"\\n--- Page {page_no + 1} ---\")\n        # Iterate over the word-cells on the page\n        for word in pred_page.iterate_cells(unit_type=TextCellUnit.WORD):\n            print(f\"Rect: {word.rect}, Text: '{word.text}'\")\n        \n        # Optionally, render the page as an image (requires Pillow)\n        # img = pred_page.render_as_image(cell_unit=TextCellUnit.CHAR)\n        # img.show() # This would open the image if Pillow is installed","lang":"python","description":"This quickstart demonstrates how to initialize the `DoclingPdfParser`, load a PDF document, and iterate through its pages to extract text at the word level, including bounding box coordinates. It also shows the import paths for necessary components. For a runnable example, ensure you replace `\"path/to/your/document.pdf\"` with a valid PDF file path. The example also briefly mentions rendering pages as images."},"warnings":[{"fix":"Ensure you are importing `DoclingPdfParser` and `PdfDocument` from `docling_parse.pdf_parser` and `TextCellUnit` from `docling_core.types.doc.page`. Review official documentation for the latest API usage if coming from a significantly older setup.","message":"With the introduction of `docling-parse` v5, previous parsing backends (especially those integrated directly into the `docling` parent project) were deprecated. Users migrating from older `docling` versions (pre-2.73.1) relying on internal parser implementations may need to update their code to use the `docling-parse` v5 API explicitly.","severity":"breaking","affected_versions":"<5.0.0 (indirectly via docling)"},{"fix":"Upgrade your Python environment to version 3.10 or newer.","message":"The `docling-parse` library requires Python 3.10 or higher. Installations on older Python versions will fail or result in unexpected behavior.","severity":"gotcha","affected_versions":"<5.0.0"},{"fix":"Utilize `pdf_doc.iterate_pages()` and process each page individually to minimize memory footprint, especially for large documents. If using `DoclingThreadedPdfParser`, configure `max_concurrent_results` appropriately.","message":"Parsing large PDF documents 'in one go' using `parser.parse_pdf_from_key()` (from older API) or similar memory-intensive methods can consume significant memory. The recommended approach for memory optimization is to process PDFs page by page.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure PDFs are well-formed where possible. Keep `docling-parse` updated to the latest version to benefit from robustness improvements. Implement robust error handling around PDF loading and parsing operations.","message":"Malformed or broken PDF documents can lead to parsing errors or infinite loops. Recent fixes (v5.3.4, v5.6.2) addressed issues like 'Robustify parse of broken pdfs' and 'Prevent infinite loop in TOC extraction with circular PDF refererences'. [cite: 23, 246 (from prompt)]","severity":"gotcha","affected_versions":"<5.6.2"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}