python-documentcloud
python-documentcloud is a simple Python wrapper for the DocumentCloud API (current version 4.5.0). It provides convenient methods to retrieve and edit documents and projects, both public and private, directly from documentcloud.org. Users can upload PDFs into their DocumentCloud account, organize them into projects, and download extracted text and images. The library is actively maintained by MuckRock and sees a monthly to quarterly release cadence for updates and new features.
Common errors
-
ImportError: cannot import name 'DocumentCloud' from 'documentcloud'
cause This typically occurs because the deprecated `documentcloud` PyPI package was installed instead of the correct `python-documentcloud` package, or a mix-up in import paths.fixFirst, uninstall the incorrect package: `pip uninstall documentcloud`. Then, install the correct one: `pip install python-documentcloud`. Ensure your import statement is `from documentcloud import DocumentCloud`. -
documentcloud.exceptions.CredentialsFailedError: Unable to obtain an access token due to bad login credentials
cause The username or password provided to the `DocumentCloud` client constructor (or via environment variables) is incorrect or lacks the necessary permissions.fixVerify that your `DC_USERNAME` and `DC_PASSWORD` environment variables are correctly set, or that the credentials passed directly to `DocumentCloud()` are accurate for a valid DocumentCloud account with API access. -
documentcloud.exceptions.MultipleObjectsReturnedError: The API returned multiple objects when it expected one
cause You used a method or query that expects a single, unique result (e.g., `client.documents.get(id)`) but multiple items matched the criteria, or the identifier was not specific enough.fixEnsure that the identifier used is truly unique (e.g., a DocumentCloud numerical ID). If searching, use `client.documents.search()` which is designed to return multiple results, and then process the list. -
documentcloud.exceptions.DoesNotExistError
cause Attempted to access a document, project, or other resource that either does not exist, or the authenticated user does not have permission to view.fixDouble-check the ID or slug of the resource you are trying to access. Confirm that your DocumentCloud account has the necessary permissions to view or modify that specific resource.
Warnings
- breaking Python 2 support was dropped starting with version 4.0.0. Earlier versions (3.x and below) supported Python 2 and 3.
- breaking The API pagination mechanism changed from page number-based to cursor-based in version 3.0.0. This means the `__len__` method is no longer implemented for `APIResults`, and you cannot randomly access pages by number. Iteration is the primary method for processing results.
- gotcha The PyPI package `documentcloud` (without the 'python-' prefix) is deprecated and refers to an older, unmaintained version of the library. Installing this package will lead to outdated functionality and potential compatibility issues.
- gotcha When uploading a new document, its status will initially be 'pending' or 'private' even if marked 'public', due to server-side processing. Attempts to interact with full metadata or public status immediately after upload may show stale data.
Install
-
pip install python-documentcloud
Imports
- DocumentCloud
import documentcloud
from documentcloud import DocumentCloud
- APIError
from documentcloud.exceptions import APIError
Quickstart
import os
from documentcloud import DocumentCloud
# Authenticate using environment variables for security
USERNAME = os.environ.get('DC_USERNAME', '')
PASSWORD = os.environ.get('DC_PASSWORD', '')
try:
# Initialize the client. For private documents/actions, provide credentials.
# For public documents, no credentials are required.
client = DocumentCloud(USERNAME, PASSWORD)
# Search for documents
query = 'MuckRock'
print(f"Searching for documents with query: '{query}'")
documents = client.documents.search(query)
if documents:
print(f"Found {len(documents)} documents:")
for doc in documents:
print(f" - ID: {doc.id}, Title: {doc.title}, Status: {doc.status}")
# Access a specific document by ID (replace with a real ID)
first_doc_id = documents[0].id
doc = client.documents.get(first_doc_id)
print(f"\nRetrieved document ID {doc.id}: '{doc.title}'")
print(f" Source: {doc.source}")
else:
print("No documents found for the given query.")
except Exception as e:
print(f"An error occurred: {e}")
print("Ensure DC_USERNAME and DC_PASSWORD environment variables are set if accessing private data.")