{"id":5845,"library":"amazon-textract-textractor","title":"Amazon Textract Textractor","description":"Textractor is a Python package designed to simplify the use of AWS Textract services for document analysis. It provides a higher-level abstraction over the AWS SDK (boto3) to easily extract text, forms, tables, and other data from documents. The library is actively maintained, with frequent minor releases to address bugs and introduce new features.","status":"active","version":"1.9.2","language":"en","source_language":"en","source_url":"https://github.com/aws-samples/amazon-textract-textractor","tags":["aws","textract","ocr","document-processing","pdf","forms","tables"],"install":[{"cmd":"pip install amazon-textract-textractor","lang":"bash","label":"Install core library"},{"cmd":"pip install amazon-textract-textractor[pdf]","lang":"bash","label":"Install with PDF processing support"}],"dependencies":[{"reason":"Required for interacting with AWS Textract and S3 services.","package":"boto3","optional":false},{"reason":"Used for local PDF rasterization when processing local PDF files. Preferred over pdf2image due to fewer external dependencies.","package":"pypdfium2","optional":true},{"reason":"Fallback for local PDF rasterization if pypdfium2 is not available. Requires Poppler to be installed on the system.","package":"pdf2image","optional":true}],"imports":[{"note":"The package name on PyPI is `amazon-textract-textractor`, but the top-level importable module is `textractor`.","wrong":"from amazon_textract_textractor import Textractor","symbol":"Textractor","correct":"from textractor import Textractor"},{"note":"The Document class holds the parsed results from Textract.","symbol":"Document","correct":"from textractor.data.document import Document"}],"quickstart":{"code":"import os\nfrom textractor import Textractor\nfrom textractor.data.constants import TextractFeatures\n\n# Ensure AWS credentials are configured (e.g., via AWS CLI, env vars, IAM roles)\n# textractor uses boto3, which automatically picks up credentials.\n# Example: os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_ACCESS_KEY'\n# os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_SECRET_KEY'\n# os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'\n\n# Initialize Textractor client\n# Make sure the region matches your S3 bucket and Textract service availability\ntractor = Textractor(region_name=os.environ.get('AWS_DEFAULT_REGION', 'us-east-1'))\n\n# Example: Process a document from an S3 path\ns3_path = \"s3://amazon-textract-public-content/samples/sample.pdf\"\nprint(f\"Processing document: {s3_path}\")\n\ndocument = tractor.start_document_analysis(\n    file_source=s3_path,\n    features=[TextractFeatures.FORMS, TextractFeatures.TABLES]\n)\n\n# Print extracted forms\nprint(\"\\n--- Forms ---\")\nfor key_value in document.key_values:\n    print(f\"{key_value.key}: {key_value.value}\")\n\n# Print extracted tables\nprint(\"\\n--- Tables ---\")\nfor i, table in enumerate(document.tables):\n    print(f\"Table {i+1}:\\n{table.to_csv(include_box=False)}\\n\")\n","lang":"python","description":"This quickstart demonstrates how to initialize the Textractor client, process a PDF document located in an S3 bucket using `start_document_analysis` to extract forms and tables, and then print the extracted data. Ensure your AWS credentials and default region are configured for boto3."},"warnings":[{"fix":"Review any code that processes Textractor's HTML output and relies on consistent IDs between the `TABLE` block and the `LAYOUT_TABLE` representation. Adjust your parsing logic to account for this ID divergence if necessary.","message":"In v1.8.3, a breaking change was introduced where `LAYOUT_TABLE` elements generated for HTML output no longer share the same ID as the original `TABLE` prediction. This affects scenarios where you relied on ID matching between Textract's TABLE block and Textractor's HTML representation of the table layout.","severity":"breaking","affected_versions":">=1.8.3"},{"fix":"For local PDF processing, always install with `pip install amazon-textract-textractor[pdf]`. If you encounter errors, verify `pypdfium2` is correctly installed and working. If `pdf2image` is used, ensure Poppler is installed and correctly configured in your system's PATH.","message":"Processing local PDF files requires additional dependencies. While `pypdfium2` is the recommended and automatically installed default for PDF support (via `pip install amazon-textract-textractor[pdf]`), the library might fall back to `pdf2image`. `pdf2image`, in turn, requires Poppler (e.g., `poppler-utils` on Linux, `brew install poppler` on macOS, or pre-compiled binaries on Windows) to be installed on the operating system, which can be a common source of installation issues.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your AWS credentials are correctly configured for `boto3` (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_DEFAULT_REGION` environment variables, or `~/.aws/credentials` file). Always specify `region_name` in the `Textractor` constructor, matching the region where your S3 buckets are located and where Textract is available.","message":"Textractor relies on `boto3` for AWS authentication and region configuration. Common issues arise from misconfigured AWS credentials (e.g., missing environment variables, incorrect `~/.aws/credentials` file, or IAM role not properly attached/assumed) or specifying an incorrect `region_name` when initializing `Textractor`, leading to 'Access Denied' or 'Region Not Found' errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Increase the `polling_interval` or `timeout` parameters when calling `start_document_analysis` or `start_document_text_detection` for very large documents if you experience timeouts. Be aware of Textract service limits regarding document size and number of pages.","message":"For larger documents (e.g., multi-page PDFs), Textract operations are asynchronous. Textractor simplifies this with `start_document_analysis` and `start_document_text_detection` methods, but under the hood, it polls for job completion. Long-running jobs can lead to timeouts or perceived hangs if not handled carefully, and `Textract` service limits should be considered.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}