Unstructured Ingest

1.4.24 · active · verified Thu Apr 16

Unstructured Ingest is a Python library that provides local ETL data pipelines to prepare diverse unstructured data (e.g., PDFs, HTML, Word docs) for RAG (Retrieval Augmented Generation) and other AI/LLM applications. It supports various source and destination connectors, enabling batch processing, partitioning, chunking, and embedding of documents. The current version is 1.4.24, and it sees frequent updates with ongoing development and new connector integrations.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a basic local-to-local ingestion pipeline using `unstructured-ingest`. It indexes a local text file, partitions it, chunks it by title, and uploads the processed output to another local directory. It shows how to configure the pipeline programmatically and highlights where to enable API-based processing and embedding providers if needed.

import os
import tempfile
from unstructured_ingest.pipeline.pipeline import Pipeline
from unstructured_ingest.interfaces import ProcessorConfig
from unstructured_ingest.connector.fsspec.local import (
    LocalIndexerConfig,
    LocalConnectionConfig,
    LocalDownloaderConfig,
    LocalUploaderConfig
)
from unstructured_ingest.processor import (
    PartitionerConfig,
    ChunkerConfig,
    EmbedderConfig
)

# Create a dummy input file and directory
input_dir = tempfile.mkdtemp()
output_dir = tempfile.mkdtemp()
with open(os.path.join(input_dir, "example.txt"), "w") as f:
    f.write("This is a test document.\nIt has multiple lines.")

print(f"Processing files from: {input_dir}")
print(f"Output will be saved to: {output_dir}")

# Set UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL if using Unstructured API
# os.environ["UNSTRUCTURED_API_KEY"] = os.environ.get("UNSTRUCTURED_API_KEY", "")
# os.environ["UNSTRUCTURED_API_URL"] = os.environ.get("UNSTRUCTURED_API_URL", "")

pipeline = Pipeline.from_configs(
    context=ProcessorConfig(
        output_dir=output_dir,
        verbose=True,
        num_processes=2
    ),
    indexer_config=LocalIndexerConfig(input_path=input_dir),
    downloader_config=LocalDownloaderConfig(),
    source_connection_config=LocalConnectionConfig(),
    partitioner_config=PartitionerConfig(
        strategy="auto",
        # partition_by_api=True, # Uncomment to use Unstructured API
        # api_key=os.environ.get("UNSTRUCTURED_API_KEY"),
        # partition_endpoint=os.environ.get("UNSTRUCTURED_API_URL"),
    ),
    chunker_config=ChunkerConfig(chunk_strategy="by_title"),
    # embedder_config=EmbedderConfig(provider="huggingface"), # Requires "unstructured-ingest[huggingface]"
    uploader_config=LocalUploaderConfig(output_dir=output_dir)
)

pipeline.run()
print("Ingestion complete. Check output directory for processed files.")

# Clean up (optional)
# import shutil
# shutil.rmtree(input_dir)
# shutil.rmtree(output_dir)

view raw JSON →