Unstructured

0.22.18 · active · verified Thu Apr 09

Unstructured is an open-source Python library designed to simplify the ingestion and preprocessing of diverse unstructured data formats, including PDFs, HTML, Word documents, and images. It provides modular functions for partitioning, cleaning, and staging data, primarily optimizing data workflows for Large Language Models (LLMs). The library is actively maintained with frequent releases, currently at version 0.22.18.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the `partition` function to process a PDF file and extract its constituent elements. It then converts these elements into a JSON format. This requires the `unstructured[pdf]` extra and system dependencies like Poppler and Tesseract for full functionality.

import os
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json

# Create a dummy PDF file for demonstration
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>endobj 4 0 obj<</Length 44>>stream\nBT\n/F1 24 Tf\n100 700 Td\n(Hello, Unstructured!) Tj\nET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000074 00000 n\n0000000130 00000 n\n0000000302 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n390\n%%EOF"

with open("dummy.pdf", "wb") as f:
    f.write(dummy_pdf_content)

# Partition the PDF document
print("Partitioning 'dummy.pdf'...")
elements = partition(filename="dummy.pdf")

# Print the extracted elements
print("\n--- Extracted Elements ---")
for element in elements:
    print(f"Type: {type(element).__name__}, Text: {element.text[:75]}...")

# Convert elements to JSON and print
print("\n--- JSON Output ---")
json_output = elements_to_json(elements)
print(json_output)

# Clean up the dummy file
os.remove("dummy.pdf")

view raw JSON →