Google Cloud Document AI
Google Cloud Document AI (Document AI) is a service for parsing structured information from unstructured or semi-structured documents using state-of-the-art Google AI, including natural language processing, computer vision, translation, and AutoML. It helps automate tedious tasks, improve data extraction, and gain deeper insights from documents. The Python client library, currently at version 3.14.0, is part of the actively maintained `google-cloud-python` monorepo, receiving frequent updates.
Warnings
- deprecated Document AI Human-in-the-Loop (HITL) functionality is deprecated and will no longer be available after January 16, 2025. New customers cannot use it. Existing users must find alternative solutions for human review workflows.
- breaking Document AI processor versions have lifecycles. For example, the Custom Extractor version `pretrained-foundation-model-v1.4-2025-02-05` will no longer be accessible after February 5, 2026. Failing to migrate to a newer processor version can lead to service disruptions.
- gotcha Python 3.9 is reaching its community End-of-Life (EOL) in October 2025. While `google-cloud-documentai` currently supports Python >= 3.9, other libraries in the `google-cloud-python` ecosystem are beginning to drop support for 3.9. It is recommended to use actively supported Python versions (3.10+) for new development and to plan upgrades for existing systems to ensure continued support and security patches.
- gotcha Document AI's OCR may confuse the digit '0' (zero) with the uppercase letter 'O' in extracted data, especially in mixed alphanumeric fields or from low-quality documents.
- gotcha During custom processor training, issues like intersecting bounding boxes or empty fields with labels can cause 'internal error' messages, which are often unspecific and difficult to diagnose.
Install
-
pip install google-cloud-documentai
Imports
- documentai
from google.cloud import documentai
- ClientOptions
from google.api_core.client_options import ClientOptions
Quickstart
import os
import base64
from google.cloud import documentai_v1 as documentai
from google.api_core.client_options import ClientOptions
project_id = os.environ.get('GCP_PROJECT_ID', 'your-project-id')
location = os.environ.get('GCP_REGION', 'us') # Format is 'us' or 'eu'
processor_id = os.environ.get('DOCUMENT_AI_PROCESSOR_ID', 'your-processor-id')
processor_version_id = os.environ.get('DOCUMENT_AI_PROCESSOR_VERSION_ID', 'rc') # Or specific version, e.g., 'pretrained-ocr-v1.0-2020-09-23'
# The full resource name of the processor version
# You can also use just 'projects/project_id/locations/location/processors/processor_id'
processor_name = f"projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}"
# Local file path to the document
# For a real application, you'd load actual document bytes.
dummy_pdf_content = b"%PDF-1.4\n1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj 2 0 obj <</Type/Pages/Count 1/Kids[3 0 R]>> endobj 3 0 obj <</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R/Parent 2 0 R>> endobj 4 0 obj <</Length 100>> stream\nBT /F1 24 Tf 100 700 Td (Hello Document AI!) Tj ET\nendstream\nendobj\nxref\n0 5\n0000000000 65535 f\n0000000009 00000 n\n0000000074 00000 n\n0000000155 00000 n\n0000000207 00000 n\ntrailer<</Size 5/Root 1 0 R>>\nstartxref\n313\n%%EOF"
mime_type = "application/pdf"
# Configure the client with regional endpoint
client_options = ClientOptions(api_endpoint=f"{location}-documentai.googleapis.com")
client = documentai.DocumentProcessorServiceClient(client_options=client_options)
# Read the file into memory
raw_document = documentai.RawDocument(content=dummy_pdf_content, mime_type=mime_type)
# For 'process_document' api: process_options is available in v1beta3 and later
request = documentai.ProcessRequest(name=processor_name, raw_document=raw_document)
# You must enable the Document AI API in your Google Cloud project before running this code.
try:
result = client.process_document(request=request)
document = result.document
print(f"Document processing complete. Text: {document.text}")
if document.pages:
print(f"Number of pages: {len(document.pages)}")
except Exception as e:
print(f"Error processing document: {e}")
print("Ensure GOOGLE_APPLICATION_CREDENTIALS environment variable is set or other auth method is configured.")
print("Also, verify project_id, location, and processor_id are correct and the API is enabled.")