{"id":2347,"library":"unstructured","title":"Unstructured","description":"Unstructured is an open-source Python library designed to simplify the ingestion and preprocessing of diverse unstructured data formats, including PDFs, HTML, Word documents, and images. It provides modular functions for partitioning, cleaning, and staging data, primarily optimizing data workflows for Large Language Models (LLMs). The library is actively maintained with frequent releases, currently at version 0.22.18.","status":"active","version":"0.22.18","language":"en","source_language":"en","source_url":"https://github.com/Unstructured-IO/unstructured","tags":["NLP","document processing","information extraction","OCR","LLM tooling","data preprocessing"],"install":[{"cmd":"pip install unstructured","lang":"bash","label":"Basic installation (for plain text, HTML, XML, JSON, Emails)"},{"cmd":"pip install \"unstructured[all-docs]\"","lang":"bash","label":"Full installation (supports all document types with Python dependencies)"},{"cmd":"pip install \"unstructured[pdf,docx]\"","lang":"bash","label":"Install specific document type dependencies (e.g., PDF and DOCX)"}],"dependencies":[{"reason":"System dependency for filetype detection, highly recommended for `partition` function.","package":"libmagic-dev","optional":true},{"reason":"System dependency for PDF processing (often required with `unstructured[pdf]`).","package":"poppler-utils","optional":true},{"reason":"System dependency for image and PDF OCR (often required with `unstructured[image]` or `unstructured[pdf]`).","package":"tesseract-ocr","optional":true},{"reason":"System dependency for Microsoft Office document processing (e.g., DOCX, PPTX).","package":"libreoffice","optional":true},{"reason":"System dependency for EPUB, ODT, and RTF file processing.","package":"pandoc","optional":true}],"imports":[{"note":"This is the recommended entry point for automatic file type detection and partitioning.","symbol":"partition","correct":"from unstructured.partition.auto import partition"},{"note":"Use this directly if the file type is known to avoid filetype detection overhead and dependencies.","symbol":"partition_pdf","correct":"from unstructured.partition.pdf import partition_pdf"},{"note":"Commonly used to convert the list of elements into a JSON output.","symbol":"elements_to_json","correct":"from unstructured.staging.base import elements_to_json"}],"quickstart":{"code":"import os\nfrom unstructured.partition.auto import partition\nfrom unstructured.staging.base import elements_to_json\n\n# Create a dummy PDF file for demonstration\ndummy_pdf_content = b\"%PDF-1.4\\n1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj 3 0 obj<</Type/Page/Parent 2 0 R/MediaBox[0 0 612 792]/Contents 4 0 R/Resources<<>>>>endobj 4 0 obj<</Length 44>>stream\\nBT\\n/F1 24 Tf\\n100 700 Td\\n(Hello, Unstructured!) Tj\\nET\\nendstream\\nendobj\\nxref\\n0 5\\n0000000000 65535 f\\n0000000009 00000 n\\n0000000074 00000 n\\n0000000130 00000 n\\n0000000302 00000 n\\ntrailer<</Size 5/Root 1 0 R>>startxref\\n390\\n%%EOF\"\n\nwith open(\"dummy.pdf\", \"wb\") as f:\n    f.write(dummy_pdf_content)\n\n# Partition the PDF document\nprint(\"Partitioning 'dummy.pdf'...\")\nelements = partition(filename=\"dummy.pdf\")\n\n# Print the extracted elements\nprint(\"\\n--- Extracted Elements ---\")\nfor element in elements:\n    print(f\"Type: {type(element).__name__}, Text: {element.text[:75]}...\")\n\n# Convert elements to JSON and print\nprint(\"\\n--- JSON Output ---\")\njson_output = elements_to_json(elements)\nprint(json_output)\n\n# Clean up the dummy file\nos.remove(\"dummy.pdf\")\n","lang":"python","description":"This quickstart demonstrates how to use the `partition` function to process a PDF file and extract its constituent elements. It then converts these elements into a JSON format. This requires the `unstructured[pdf]` extra and system dependencies like Poppler and Tesseract for full functionality."},"warnings":[{"fix":"Review `unstructured` dependency usage. Ensure spaCy models are installed as needed if you were leveraging specific NLP capabilities implicitly through older Unstructured versions.","message":"Version 0.21.0 replaced NLTK with spaCy to remediate CVE-2025-14009. If your project relied on NLTK components used by Unstructured, you might need to update your dependencies or code.","severity":"breaking","affected_versions":">=0.21.0"},{"fix":"Install the recommended system dependencies for the document types you intend to process. Refer to the 'Full Installation' guide in the official documentation for OS-specific instructions.","message":"Full functionality for various document types (e.g., PDFs, images, Office docs) requires installing additional system dependencies (e.g., `libmagic-dev`, `poppler-utils`, `tesseract-ocr`, `libreoffice`, `pandoc`). Without these, some document processing will fail or be limited.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Evaluate your use case: for serious production deployments, consider the Unstructured API or UI. For local prototyping or smaller-scale automation, the open-source library is suitable.","message":"The open-source library is primarily for prototyping. For production-grade scenarios, Unstructured-IO recommends using their hosted UI or API, which offers greater scalability, robustness, and more advanced features.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Pass `unique_element_ids=True` to the `partition` function (e.g., `partition(filename, unique_element_ids=True)`) to generate UUIDs for element IDs, ensuring uniqueness.","message":"Element IDs are SHA-256 hashes by default and are not guaranteed to be unique across different elements with identical text content. This can lead to collisions if used as primary keys.","severity":"gotcha","affected_versions":"All versions"},{"fix":"To *not* opt-out (i.e., enable telemetry), unset the environment variable or ensure it is an empty string. To opt-out, set it to any non-empty value (e.g., `export DO_NOT_TRACK=1`).","message":"The telemetry (analytics) opt-out environment variable semantics changed. `DO_NOT_TRACK` and `SCARF_NO_ANALYTICS` now treat any non-empty string value (e.g., 'false', '0', 'no') as an opt-out. Previously, only the exact string 'true' worked.","severity":"breaking","affected_versions":"Recent versions, starting around 0.22.x"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}