Unstructured Inference

1.6.6 · active · verified Sun Apr 12

unstructured-inference provides the core model inference code for layout parsing models used in the Unstructured.IO ecosystem. It enables the extraction of structured content from diverse unstructured documents like PDFs and images, supporting various detection models such as Detectron2 and YOLOX. The library is actively maintained with frequent releases, with the current version being 1.6.6.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a PDF document and extract its layout elements using the default inference model. It creates a dummy PDF for immediate execution. In a real application, you would replace `temp_pdf_path` with the path to your actual PDF file. The output includes the detected element types and their truncated text content.

import os
import tempfile

# Create a dummy PDF file for demonstration
# In a real scenario, you would provide the path to your actual PDF.
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_pdf:
    temp_pdf_path = temp_pdf.name
    temp_pdf.write(b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello Unstructured!) Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000108 00000 n\n0000000201 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n294\n%%EOF")

from unstructured_inference.inference.layout import DocumentLayout

try:
    # Perform layout parsing on the document
    # For real use, replace temp_pdf_path with your PDF file path.
    layout = DocumentLayout.from_file(temp_pdf_path)

    print(f"Found {len(layout.pages)} page(s) in the document.")
    for i, page in enumerate(layout.pages):
        print(f"--- Page {i+1} ---")
        for element in page.elements:
            print(f"Element Type: {element.type}, Text: {element.text[:50]}...")
            # You can also access bounding box, model name, etc.
            # print(f"  Bounding Box: {element.bbox}, Model: {element.detectron_model_name}")
finally:
    # Clean up the dummy PDF file
    os.remove(temp_pdf_path)

view raw JSON →