{"id":7837,"library":"unstructured-ingest","title":"Unstructured Ingest","description":"Unstructured Ingest is a Python library that provides local ETL data pipelines to prepare diverse unstructured data (e.g., PDFs, HTML, Word docs) for RAG (Retrieval Augmented Generation) and other AI/LLM applications. It supports various source and destination connectors, enabling batch processing, partitioning, chunking, and embedding of documents. The current version is 1.4.24, and it sees frequent updates with ongoing development and new connector integrations.","status":"active","version":"1.4.24","language":"en","source_language":"en","source_url":"https://github.com/Unstructured-IO/unstructured-ingest","tags":["ETL","RAG","LLM","data-pipeline","unstructured-data","document-processing","connectors"],"install":[{"cmd":"pip install unstructured-ingest","lang":"bash","label":"Base Installation"},{"cmd":"pip install \"unstructured-ingest[pdf,s3]\"","lang":"bash","label":"Installation with optional dependencies (e.g., PDF, S3)"}],"dependencies":[{"reason":"Core library for partitioning and processing documents.","package":"unstructured","optional":false},{"reason":"Required for local file system operations.","package":"fsspec","optional":false},{"reason":"Specific connectors (e.g., 'pdf', 's3', 'huggingface') require their own extra dependencies for functionality.","package":"various_extras","optional":true}],"imports":[{"note":"As of v0.7.0, the 'v2' calling pattern moved up one level in the package, making the direct import the correct one.","wrong":"from unstructured_ingest.v2.pipeline.pipeline import Pipeline","symbol":"Pipeline","correct":"from unstructured_ingest.pipeline.pipeline import Pipeline"},{"note":"The 'v2' namespace was removed from the import path in v0.7.0.","wrong":"from unstructured_ingest.v2.interfaces import ProcessorConfig","symbol":"ProcessorConfig","correct":"from unstructured_ingest.interfaces import ProcessorConfig"},{"note":"The 'v2' namespace was removed, and import paths were restructured in v0.7.0. Newer versions use `connector` instead of `processes.connectors`.","wrong":"from unstructured_ingest.v2.processes.connectors.local import LocalIndexerConfig","symbol":"LocalIndexerConfig","correct":"from unstructured_ingest.connector.fsspec.local import LocalIndexerConfig"}],"quickstart":{"code":"import os\nimport tempfile\nfrom unstructured_ingest.pipeline.pipeline import Pipeline\nfrom unstructured_ingest.interfaces import ProcessorConfig\nfrom unstructured_ingest.connector.fsspec.local import (\n    LocalIndexerConfig,\n    LocalConnectionConfig,\n    LocalDownloaderConfig,\n    LocalUploaderConfig\n)\nfrom unstructured_ingest.processor import (\n    PartitionerConfig,\n    ChunkerConfig,\n    EmbedderConfig\n)\n\n# Create a dummy input file and directory\ninput_dir = tempfile.mkdtemp()\noutput_dir = tempfile.mkdtemp()\nwith open(os.path.join(input_dir, \"example.txt\"), \"w\") as f:\n    f.write(\"This is a test document.\\nIt has multiple lines.\")\n\nprint(f\"Processing files from: {input_dir}\")\nprint(f\"Output will be saved to: {output_dir}\")\n\n# Set UNSTRUCTURED_API_KEY and UNSTRUCTURED_API_URL if using Unstructured API\n# os.environ[\"UNSTRUCTURED_API_KEY\"] = os.environ.get(\"UNSTRUCTURED_API_KEY\", \"\")\n# os.environ[\"UNSTRUCTURED_API_URL\"] = os.environ.get(\"UNSTRUCTURED_API_URL\", \"\")\n\npipeline = Pipeline.from_configs(\n    context=ProcessorConfig(\n        output_dir=output_dir,\n        verbose=True,\n        num_processes=2\n    ),\n    indexer_config=LocalIndexerConfig(input_path=input_dir),\n    downloader_config=LocalDownloaderConfig(),\n    source_connection_config=LocalConnectionConfig(),\n    partitioner_config=PartitionerConfig(\n        strategy=\"auto\",\n        # partition_by_api=True, # Uncomment to use Unstructured API\n        # api_key=os.environ.get(\"UNSTRUCTURED_API_KEY\"),\n        # partition_endpoint=os.environ.get(\"UNSTRUCTURED_API_URL\"),\n    ),\n    chunker_config=ChunkerConfig(chunk_strategy=\"by_title\"),\n    # embedder_config=EmbedderConfig(provider=\"huggingface\"), # Requires \"unstructured-ingest[huggingface]\"\n    uploader_config=LocalUploaderConfig(output_dir=output_dir)\n)\n\npipeline.run()\nprint(\"Ingestion complete. Check output directory for processed files.\")\n\n# Clean up (optional)\n# import shutil\n# shutil.rmtree(input_dir)\n# shutil.rmtree(output_dir)","lang":"python","description":"This quickstart demonstrates how to set up a basic local-to-local ingestion pipeline using `unstructured-ingest`. It indexes a local text file, partitions it, chunks it by title, and uploads the processed output to another local directory. It shows how to configure the pipeline programmatically and highlights where to enable API-based processing and embedding providers if needed."},"warnings":[{"fix":"Remove `.v2` from all `from unstructured_ingest.v2... import ...` statements. For example, `from unstructured_ingest.v2.pipeline.pipeline import Pipeline` becomes `from unstructured_ingest.pipeline.pipeline import Pipeline`.","message":"Version 0.7.0 introduced significant breaking changes by moving the 'v2' calling pattern up one level in the package and deprecating/removing the 'v1' pattern. Existing code using `from unstructured_ingest.v2...` will break.","severity":"breaking","affected_versions":">=0.7.0"},{"fix":"Always check the documentation for the specific source, destination, file type, or processing step you are using and install the recommended extra dependencies (e.g., `pip install \"unstructured-ingest[pdf,s3,huggingface]\"`).","message":"Many functionalities (e.g., specific file types like PDF, or connectors like S3, or embedding providers) require additional 'extra' dependencies to be installed with `pip install \"unstructured-ingest[extra_name]\"`. Forgetting these will lead to runtime errors when trying to process unsupported types or connect to services.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Set `os.environ[\"UNSTRUCTURED_API_KEY\"]` and `os.environ[\"UNSTRUCTURED_API_URL\"]` or pass `api_key` and `partition_endpoint` arguments to `PartitionerConfig` (and `ChunkerConfig` or `EmbedderConfig` if applicable) when `partition_by_api=True`.","message":"If using the Unstructured API for partitioning, chunking, or embedding, you must provide valid `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` (or `partition_endpoint`) environment variables or parameters. Local processing does not require these.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install the necessary extra dependencies for the file type. For example, for PDFs, run `pip install \"unstructured-ingest[pdf]\"`. For general troubleshooting, ensure `libmagic-dev`, `poppler-utils`, and `tesseract-ocr` are installed as system dependencies.","cause":"The input file type is not recognized or is not supported by the currently installed dependencies. Often occurs with PDFs, DOCX, or other complex formats without their respective 'extra' dependencies.","error":"unstructured_ingest.error.PartitionError: Error in partitioning content: Invalid file /path/to/file. The FileType.UNK file type is not supported in partition."},{"fix":"Update your import statements. Remove `.v2` from the import path. For example, change `from unstructured_ingest.v2.pipeline.pipeline import Pipeline` to `from unstructured_ingest.pipeline.pipeline import Pipeline`.","cause":"This error occurs in versions 0.7.0 and later if you are using the old 'v2' import path which was removed in a breaking change.","error":"ModuleNotFoundError: No module named 'unstructured_ingest.v2.pipeline'"},{"fix":"Ensure `UNSTRUCTURED_API_KEY` and `UNSTRUCTURED_API_URL` (or `partition_endpoint`) environment variables are correctly set and contain valid credentials from your Unstructured account. Double-check that `partition_by_api=True` is correctly configured if you intend to use the API.","cause":"This typically means an invalid or missing API key and/or API URL when attempting to use the Unstructured API for processing.","error":"requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://api.unstructuredapp.io/general/v0/general/partition"}]}