Amazon Textract Textractor
Textractor is a Python package designed to simplify the use of AWS Textract services for document analysis. It provides a higher-level abstraction over the AWS SDK (boto3) to easily extract text, forms, tables, and other data from documents. The library is actively maintained, with frequent minor releases to address bugs and introduce new features.
Warnings
- breaking In v1.8.3, a breaking change was introduced where `LAYOUT_TABLE` elements generated for HTML output no longer share the same ID as the original `TABLE` prediction. This affects scenarios where you relied on ID matching between Textract's TABLE block and Textractor's HTML representation of the table layout.
- gotcha Processing local PDF files requires additional dependencies. While `pypdfium2` is the recommended and automatically installed default for PDF support (via `pip install amazon-textract-textractor[pdf]`), the library might fall back to `pdf2image`. `pdf2image`, in turn, requires Poppler (e.g., `poppler-utils` on Linux, `brew install poppler` on macOS, or pre-compiled binaries on Windows) to be installed on the operating system, which can be a common source of installation issues.
- gotcha Textractor relies on `boto3` for AWS authentication and region configuration. Common issues arise from misconfigured AWS credentials (e.g., missing environment variables, incorrect `~/.aws/credentials` file, or IAM role not properly attached/assumed) or specifying an incorrect `region_name` when initializing `Textractor`, leading to 'Access Denied' or 'Region Not Found' errors.
- gotcha For larger documents (e.g., multi-page PDFs), Textract operations are asynchronous. Textractor simplifies this with `start_document_analysis` and `start_document_text_detection` methods, but under the hood, it polls for job completion. Long-running jobs can lead to timeouts or perceived hangs if not handled carefully, and `Textract` service limits should be considered.
Install
-
pip install amazon-textract-textractor -
pip install amazon-textract-textractor[pdf]
Imports
- Textractor
from textractor import Textractor
- Document
from textractor.data.document import Document
Quickstart
import os
from textractor import Textractor
from textractor.data.constants import TextractFeatures
# Ensure AWS credentials are configured (e.g., via AWS CLI, env vars, IAM roles)
# textractor uses boto3, which automatically picks up credentials.
# Example: os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_ACCESS_KEY'
# os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_SECRET_KEY'
# os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'
# Initialize Textractor client
# Make sure the region matches your S3 bucket and Textract service availability
tractor = Textractor(region_name=os.environ.get('AWS_DEFAULT_REGION', 'us-east-1'))
# Example: Process a document from an S3 path
s3_path = "s3://amazon-textract-public-content/samples/sample.pdf"
print(f"Processing document: {s3_path}")
document = tractor.start_document_analysis(
file_source=s3_path,
features=[TextractFeatures.FORMS, TextractFeatures.TABLES]
)
# Print extracted forms
print("\n--- Forms ---")
for key_value in document.key_values:
print(f"{key_value.key}: {key_value.value}")
# Print extracted tables
print("\n--- Tables ---")
for i, table in enumerate(document.tables):
print(f"Table {i+1}:\n{table.to_csv(include_box=False)}\n")