{"id":3503,"library":"google-cloud-documentai","title":"Google Cloud Document AI","description":"Google Cloud Document AI (Document AI) is a service for parsing structured information from unstructured or semi-structured documents using state-of-the-art Google AI, including natural language processing, computer vision, translation, and AutoML. It helps automate tedious tasks, improve data extraction, and gain deeper insights from documents. The Python client library, currently at version 3.14.0, is part of the actively maintained `google-cloud-python` monorepo, receiving frequent updates.","status":"active","version":"3.14.0","language":"en","source_language":"en","source_url":"https://github.com/googleapis/google-cloud-python/tree/main/packages/google-cloud-documentai","tags":["google cloud","document processing","ocr","ai","machine learning","enterprise ai"],"install":[{"cmd":"pip install google-cloud-documentai","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Requires Python >= 3.9.","package":"Python","optional":false},{"reason":"Required for processing documents from or writing results to Google Cloud Storage buckets.","package":"google-cloud-storage","optional":true}],"imports":[{"symbol":"documentai","correct":"from google.cloud import documentai"},{"symbol":"ClientOptions","correct":"from google.api_core.client_options import ClientOptions"}],"quickstart":{"code":"import os\nimport base64\nfrom google.cloud import documentai_v1 as documentai\nfrom google.api_core.client_options import ClientOptions\n\nproject_id = os.environ.get('GCP_PROJECT_ID', 'your-project-id')\nlocation = os.environ.get('GCP_REGION', 'us') # Format is 'us' or 'eu'\nprocessor_id = os.environ.get('DOCUMENT_AI_PROCESSOR_ID', 'your-processor-id')\nprocessor_version_id = os.environ.get('DOCUMENT_AI_PROCESSOR_VERSION_ID', 'rc') # Or specific version, e.g., 'pretrained-ocr-v1.0-2020-09-23'\n\n# The full resource name of the processor version\n# You can also use just 'projects/project_id/locations/location/processors/processor_id'\nprocessor_name = f\"projects/{project_id}/locations/{location}/processors/{processor_id}/processorVersions/{processor_version_id}\"\n\n# Local file path to the document\n# For a real application, you'd load actual document bytes.\ndummy_pdf_content = b\"%PDF-1.4\\n1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj 2 0 obj <</Type/Pages/Count 1/Kids[3 0 R]>> endobj 3 0 obj <</Type/Page/MediaBox[0 0 612 792]/Contents 4 0 R/Parent 2 0 R>> endobj 4 0 obj <</Length 100>> stream\\nBT /F1 24 Tf 100 700 Td (Hello Document AI!) Tj ET\\nendstream\\nendobj\\nxref\\n0 5\\n0000000000 65535 f\\n0000000009 00000 n\\n0000000074 00000 n\\n0000000155 00000 n\\n0000000207 00000 n\\ntrailer<</Size 5/Root 1 0 R>>\\nstartxref\\n313\\n%%EOF\"\n\nmime_type = \"application/pdf\"\n\n# Configure the client with regional endpoint\nclient_options = ClientOptions(api_endpoint=f\"{location}-documentai.googleapis.com\")\nclient = documentai.DocumentProcessorServiceClient(client_options=client_options)\n\n# Read the file into memory\nraw_document = documentai.RawDocument(content=dummy_pdf_content, mime_type=mime_type)\n\n# For 'process_document' api: process_options is available in v1beta3 and later\nrequest = documentai.ProcessRequest(name=processor_name, raw_document=raw_document)\n\n# You must enable the Document AI API in your Google Cloud project before running this code.\ntry:\n    result = client.process_document(request=request)\n    document = result.document\n    print(f\"Document processing complete. Text: {document.text}\")\n    if document.pages:\n        print(f\"Number of pages: {len(document.pages)}\")\nexcept Exception as e:\n    print(f\"Error processing document: {e}\")\n    print(\"Ensure GOOGLE_APPLICATION_CREDENTIALS environment variable is set or other auth method is configured.\")\n    print(\"Also, verify project_id, location, and processor_id are correct and the API is enabled.\")\n\n","lang":"python","description":"This quickstart demonstrates how to process a raw PDF document using a Document AI processor. It requires setting up authentication, a Google Cloud project, and an enabled Document AI processor. Ensure `GOOGLE_APPLICATION_CREDENTIALS` environment variable points to your service account key file, or that Application Default Credentials are configured."},"warnings":[{"fix":"Migrate to a Google Cloud certified partner solution for human review and correction, or implement custom human review workflows.","message":"Document AI Human-in-the-Loop (HITL) functionality is deprecated and will no longer be available after January 16, 2025. New customers cannot use it. Existing users must find alternative solutions for human review workflows.","severity":"deprecated","affected_versions":"<= 3.x (before January 16, 2025)"},{"fix":"Regularly check the Document AI release notes for processor version deprecations and plan migrations to newer, supported versions (e.g., `pretrained-foundation-model-v1.5-2025-05-05`).","message":"Document AI processor versions have lifecycles. For example, the Custom Extractor version `pretrained-foundation-model-v1.4-2025-02-05` will no longer be accessible after February 5, 2026. Failing to migrate to a newer processor version can lead to service disruptions.","severity":"breaking","affected_versions":"All versions using specific processor versions"},{"fix":"Upgrade to Python 3.10 or a newer actively supported version.","message":"Python 3.9 is reaching its community End-of-Life (EOL) in October 2025. While `google-cloud-documentai` currently supports Python >= 3.9, other libraries in the `google-cloud-python` ecosystem are beginning to drop support for 3.9. It is recommended to use actively supported Python versions (3.10+) for new development and to plan upgrades for existing systems to ensure continued support and security patches.","severity":"gotcha","affected_versions":"< 3.10"},{"fix":"Improve input document quality, preprocess images to enhance contrast/sharpness, fine-tune custom extractor models with more specific training data, or implement post-processing logic to correct common confusions.","message":"Document AI's OCR may confuse the digit '0' (zero) with the uppercase letter 'O' in extracted data, especially in mixed alphanumeric fields or from low-quality documents.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Carefully review labeled documents for overlapping bounding boxes, ensure all labeled fields contain OCRable text, and systematically test training data. Deleting the latest revision of a faulty document from the dataset can sometimes resolve issues but may lead to data loss.","message":"During custom processor training, issues like intersecting bounding boxes or empty fields with labels can cause 'internal error' messages, which are often unspecific and difficult to diagnose.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}