Unstructured Inference
unstructured-inference provides the core model inference code for layout parsing models used in the Unstructured.IO ecosystem. It enables the extraction of structured content from diverse unstructured documents like PDFs and images, supporting various detection models such as Detectron2 and YOLOX. The library is actively maintained with frequent releases, with the current version being 1.6.6.
Warnings
- gotcha Detectron2 is a crucial dependency for using many layout parsing models within unstructured-inference, particularly those from the layoutparser model zoo. It is NOT automatically installed with `pip install unstructured-inference` and its installation can be complex, especially on Windows, where it's not officially supported. Users on macOS/Linux may need to build it from source.
- breaking The library has a strict Python version requirement, currently supporting Python 3.12 only. Using other Python versions will lead to installation or runtime errors.
- gotcha When `unstructured-inference` is used in conjunction with the main `unstructured` library, it's crucial to keep both packages synchronized to avoid unexpected behavior or errors, as `unstructured-inference` provides the underlying model capabilities for `unstructured`'s partitioning bricks.
- gotcha There have been reports of issues with table extraction functionality in recent versions of `unstructured-inference`, where the latest versions may not extract tables as effectively as older versions.
- deprecated When using `unstructured`'s `partition` function with `strategy='hi_res'` (which utilizes `unstructured-inference` models), the `model_name` parameter is deprecated. Users should now use `hi_res_model_name` instead.
Install
-
pip install unstructured-inference -
pip install 'git+https://github.com/facebookresearch/detectron2.git@57bdb21249d5418c130d54e2ebdc94dda7a4c01a'
Imports
- DocumentLayout
from unstructured_inference.inference.layout import DocumentLayout
- get_model
from unstructured_inference.models.base import get_model
Quickstart
import os
import tempfile
# Create a dummy PDF file for demonstration
# In a real scenario, you would provide the path to your actual PDF.
with tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) as temp_pdf:
temp_pdf_path = temp_pdf.name
temp_pdf.write(b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R>>endobj 4 0 obj<</Length 41>>stream\nBT /F1 24 Tf 100 700 Td (Hello Unstructured!) Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000055 00000 n\n0000000108 00000 n\n0000000201 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n294\n%%EOF")
from unstructured_inference.inference.layout import DocumentLayout
try:
# Perform layout parsing on the document
# For real use, replace temp_pdf_path with your PDF file path.
layout = DocumentLayout.from_file(temp_pdf_path)
print(f"Found {len(layout.pages)} page(s) in the document.")
for i, page in enumerate(layout.pages):
print(f"--- Page {i+1} ---")
for element in page.elements:
print(f"Element Type: {element.type}, Text: {element.text[:50]}...")
# You can also access bounding box, model name, etc.
# print(f" Bounding Box: {element.bbox}, Model: {element.detectron_model_name}")
finally:
# Clean up the dummy PDF file
os.remove(temp_pdf_path)