Unstructured
Unstructured is an open-source Python library designed to simplify the ingestion and preprocessing of diverse unstructured data formats, including PDFs, HTML, Word documents, and images. It provides modular functions for partitioning, cleaning, and staging data, primarily optimizing data workflows for Large Language Models (LLMs). The library is actively maintained with frequent releases, currently at version 0.22.18.
Warnings
- breaking Version 0.21.0 replaced NLTK with spaCy to remediate CVE-2025-14009. If your project relied on NLTK components used by Unstructured, you might need to update your dependencies or code.
- gotcha Full functionality for various document types (e.g., PDFs, images, Office docs) requires installing additional system dependencies (e.g., `libmagic-dev`, `poppler-utils`, `tesseract-ocr`, `libreoffice`, `pandoc`). Without these, some document processing will fail or be limited.
- gotcha The open-source library is primarily for prototyping. For production-grade scenarios, Unstructured-IO recommends using their hosted UI or API, which offers greater scalability, robustness, and more advanced features.
- gotcha Element IDs are SHA-256 hashes by default and are not guaranteed to be unique across different elements with identical text content. This can lead to collisions if used as primary keys.
- breaking The telemetry (analytics) opt-out environment variable semantics changed. `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` now treat any non-empty string value (e.g., 'false', '0', 'no') as an opt-out. Previously, only the exact string 'true' worked.
Install
-
pip install unstructured -
pip install "unstructured[all-docs]" -
pip install "unstructured[pdf,docx]"
Imports
- partition
from unstructured.partition.auto import partition
- partition_pdf
from unstructured.partition.pdf import partition_pdf
- elements_to_json
from unstructured.staging.base import elements_to_json
Quickstart
import os
from unstructured.partition.auto import partition
from unstructured.staging.base import elements_to_json
# Create a dummy PDF file for demonstration
dummy_pdf_content = b"%PDF-1.4\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>endobj 4 0 obj<</Length 44>>stream\nBT\n/F1 24 Tf\n100 700 Td\n(Hello, Unstructured!) Tj\nET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000074 00000 n\n0000000130 00000 n\n0000000302 00000 n\ntrailer<</Size 5/Root 1 0 R>>startxref\n390\n%%EOF"
with open("dummy.pdf", "wb") as f:
f.write(dummy_pdf_content)
# Partition the PDF document
print("Partitioning 'dummy.pdf'...")
elements = partition(filename="dummy.pdf")
# Print the extracted elements
print("\n--- Extracted Elements ---")
for element in elements:
print(f"Type: {type(element).__name__}, Text: {element.text[:75]}...")
# Convert elements to JSON and print
print("\n--- JSON Output ---")
json_output = elements_to_json(elements)
print(json_output)
# Clean up the dummy file
os.remove("dummy.pdf")