Unstructured Ingest
Unstructured Ingest is a Python library that provides local ETL data pipelines to prepare diverse unstructured data (e.g., PDFs, HTML, Word docs) for RAG (Retrieval Augmented Generation) and other AI/LLM applications. It supports various source and destination connectors, enabling batch processing, partitioning, chunking, and embedding of documents. The current version is 1.4.24, and it sees frequent updates with ongoing development and new connector integrations.
Common errors
-
unstructured_ingest.error.PartitionError: Error in partitioning content: Invalid file /path/to/file. The FileType.UNK file type is not supported in partition.
cause The input file type is not recognized or is not supported by the currently installed dependencies. Often occurs with PDFs, DOCX, or other complex formats without their respective 'extra' dependencies.fixInstall the necessary extra dependencies for the file type. For example, for PDFs, run `pip install "unstructured-ingest[pdf]"`. For general troubleshooting, ensure `libmagic-dev`, `poppler-utils`, and `tesseract-ocr` are installed as system dependencies. -
ModuleNotFoundError: No module named 'unstructured_ingest.v2.pipeline'
cause This error occurs in versions 0.7.0 and later if you are using the old 'v2' import path which was removed in a breaking change.fixUpdate your import statements. Remove `.v2` from the import path. For example, change `from unstructured_ingest.v2.pipeline.pipeline import Pipeline` to `from unstructured_ingest.pipeline.pipeline import Pipeline`. -
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://api.unstructuredapp.io/general/v0/general/partition
cause This typically means an invalid or missing API key and/or API URL when attempting to use the Unstructured API for processing.fixEnsure `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` (or `partition_endpoint`) environment variables are correctly set and contain valid credentials from your Unstructured account. Double-check that `partition_by_api=True` is correctly configured if you intend to use the API.
Warnings
- breaking Version 0.7.0 introduced significant breaking changes by moving the 'v2' calling pattern up one level in the package and deprecating/removing the 'v1' pattern. Existing code using `from unstructured_ingest.v2...` will break.
- gotcha Many functionalities (e.g., specific file types like PDF, or connectors like S3, or embedding providers) require additional 'extra' dependencies to be installed with `pip install "unstructured-ingest[extra_name]"`. Forgetting these will lead to runtime errors when trying to process unsupported types or connect to services.
- gotcha If using the Unstructured API for partitioning, chunking, or embedding, you must provide valid `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` (or `partition_endpoint`) environment variables or parameters. Local processing does not require these.
Install
-
pip install unstructured-ingest -
pip install "unstructured-ingest[pdf,s3]"
Imports
- Pipeline
from unstructured_ingest.v2.pipeline.pipeline import Pipeline
from unstructured_ingest.pipeline.pipeline import Pipeline
- ProcessorConfig
from unstructured_ingest.v2.interfaces import ProcessorConfig
from unstructured_ingest.interfaces import ProcessorConfig
- LocalIndexerConfig
from unstructured_ingest.v2.processes.connectors.local import LocalIndexerConfig
from unstructured_ingest.connector.fsspec.local import LocalIndexerConfig
Quickstart
import os
import tempfile
from unstructured_ingest.pipeline.pipeline import Pipeline
from unstructured_ingest.interfaces import ProcessorConfig
from unstructured_ingest.connector.fsspec.local import (
LocalIndexerConfig,
LocalConnectionConfig,
LocalDownloaderConfig,
LocalUploaderConfig
)
from unstructured_ingest.processor import (
PartitionerConfig,
ChunkerConfig,
EmbedderConfig
)
# Create a dummy input file and directory
input_dir = tempfile.mkdtemp()
output_dir = tempfile.mkdtemp()
with open(os.path.join(input_dir, "example.txt"), "w") as f:
f.write("This is a test document.\nIt has multiple lines.")
print(f"Processing files from: {input_dir}")
print(f"Output will be saved to: {output_dir}")
# Set UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL if using Unstructured API
# os.environ["UNSTRUCTURED_API_KEY"] = os.environ.get("UNSTRUCTURED_API_KEY", "")
# os.environ["UNSTRUCTURED_API_URL"] = os.environ.get("UNSTRUCTURED_API_URL", "")
pipeline = Pipeline.from_configs(
context=ProcessorConfig(
output_dir=output_dir,
verbose=True,
num_processes=2
),
indexer_config=LocalIndexerConfig(input_path=input_dir),
downloader_config=LocalDownloaderConfig(),
source_connection_config=LocalConnectionConfig(),
partitioner_config=PartitionerConfig(
strategy="auto",
# partition_by_api=True, # Uncomment to use Unstructured API
# api_key=os.environ.get("UNSTRUCTURED_API_KEY"),
# partition_endpoint=os.environ.get("UNSTRUCTURED_API_URL"),
),
chunker_config=ChunkerConfig(chunk_strategy="by_title"),
# embedder_config=EmbedderConfig(provider="huggingface"), # Requires "unstructured-ingest[huggingface]"
uploader_config=LocalUploaderConfig(output_dir=output_dir)
)
pipeline.run()
print("Ingestion complete. Check output directory for processed files.")
# Clean up (optional)
# import shutil
# shutil.rmtree(input_dir)
# shutil.rmtree(output_dir)