LangChain Unstructured Integration

1.0.1 · active · verified Thu Apr 16

langchain-unstructured is an integration package connecting the LangChain framework with Unstructured, a library for parsing and processing unstructured documents. It provides document loaders to extract text and metadata from various file types (PDFs, images, HTML, etc.) for use in LangChain applications. The current version is 1.0.1, with a release cadence that has recently seen a rapid transition to 1.x and subsequent minor updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `UnstructuredFileLoader` to load text from a local file. For more complex file types like PDFs or images, ensure you have the necessary `unstructured` extra dependencies installed (e.g., `pip install "unstructured[pdf]"`). For `UnstructuredAPIFileLoader`, ensure the `UNSTRUCTURED_API_KEY` environment variable is set.

import os
from langchain_unstructured.document_loaders import UnstructuredFileLoader

# Create a dummy file for demonstration
with open("example.txt", "w") as f:
    f.write("This is a test document.\n")
    f.write("It contains some sample text to be loaded.")

# Instantiate the loader for a local file
loader = UnstructuredFileLoader("example.txt")

# Load the document(s)
docs = loader.load()

# Print the content of the first loaded document
if docs:
    print(f"Loaded {len(docs)} document(s).")
    print(f"Page content: {docs[0].page_content[:50]}...")
    print(f"Metadata: {docs[0].metadata}")

# Clean up the dummy file
os.remove("example.txt")

view raw JSON →