LangChain Unstructured Integration
langchain-unstructured is an integration package connecting the LangChain framework with Unstructured, a library for parsing and processing unstructured documents. It provides document loaders to extract text and metadata from various file types (PDFs, images, HTML, etc.) for use in LangChain applications. The current version is 1.0.1, with a release cadence that has recently seen a rapid transition to 1.x and subsequent minor updates.
Common errors
-
ModuleNotFoundError: No module named 'langchain_unstructured'
cause The `langchain-unstructured` package is not installed.fixRun `pip install langchain-unstructured`. -
ModuleNotFoundError: No module named 'unstructured'
cause The core `unstructured` library, a dependency, is missing.fixRun `pip install unstructured`. For specific file types, consider `pip install "unstructured[pdf]"` or other extras. -
ImportError: cannot import name 'UnstructuredFileLoader' from 'langchain.document_loaders'
cause Attempting to import `UnstructuredFileLoader` from the old `langchain` package path.fixChange the import to `from langchain_unstructured.document_loaders import UnstructuredFileLoader`. -
ValueError: Unstructured API key not provided.
cause When using `UnstructuredAPIFileLoader`, the `UNSTRUCTURED_API_KEY` environment variable is not set and no API key was provided in the constructor.fixSet the environment variable: `export UNSTRUCTURED_API_KEY='your_api_key'` or pass it directly: `UnstructuredAPIFileLoader(..., unstructured_api_key='your_api_key')`.
Warnings
- breaking Migration from `langchain` document loaders to `langchain-unstructured` package.
- breaking Upgrade of `langchain-core` dependency version in `langchain-unstructured` v1.0.0.
- gotcha Unstructured requires additional dependencies for specific file types.
- gotcha Using `UnstructuredAPIFileLoader` requires an API key.
Install
-
pip install langchain-unstructured unstructured -
pip install "langchain-unstructured[pdf]" "unstructured[pdf]"
Imports
- UnstructuredFileLoader
from langchain_unstructured.document_loaders import UnstructuredFileLoader
- UnstructuredAPIFileLoader
from langchain.document_loaders import UnstructuredAPIFileLoader
from langchain_unstructured.document_loaders import UnstructuredAPIFileLoader
Quickstart
import os
from langchain_unstructured.document_loaders import UnstructuredFileLoader
# Create a dummy file for demonstration
with open("example.txt", "w") as f:
f.write("This is a test document.\n")
f.write("It contains some sample text to be loaded.")
# Instantiate the loader for a local file
loader = UnstructuredFileLoader("example.txt")
# Load the document(s)
docs = loader.load()
# Print the content of the first loaded document
if docs:
print(f"Loaded {len(docs)} document(s).")
print(f"Page content: {docs[0].page_content[:50]}...")
print(f"Metadata: {docs[0].metadata}")
# Clean up the dummy file
os.remove("example.txt")