Amazon Textract Textractor

1.9.2 · active · verified Tue Apr 14

Textractor is a Python package designed to simplify the use of AWS Textract services for document analysis. It provides a higher-level abstraction over the AWS SDK (boto3) to easily extract text, forms, tables, and other data from documents. The library is actively maintained, with frequent minor releases to address bugs and introduce new features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the Textractor client, process a PDF document located in an S3 bucket using `start_document_analysis` to extract forms and tables, and then print the extracted data. Ensure your AWS credentials and default region are configured for boto3.

import os
from textractor import Textractor
from textractor.data.constants import TextractFeatures

# Ensure AWS credentials are configured (e.g., via AWS CLI, env vars, IAM roles)
# textractor uses boto3, which automatically picks up credentials.
# Example: os.environ['AWS_ACCESS_KEY_ID'] = 'YOUR_ACCESS_KEY'
# os.environ['AWS_SECRET_ACCESS_KEY'] = 'YOUR_SECRET_KEY'
# os.environ['AWS_DEFAULT_REGION'] = 'us-east-1'

# Initialize Textractor client
# Make sure the region matches your S3 bucket and Textract service availability
tractor = Textractor(region_name=os.environ.get('AWS_DEFAULT_REGION', 'us-east-1'))

# Example: Process a document from an S3 path
s3_path = "s3://amazon-textract-public-content/samples/sample.pdf"
print(f"Processing document: {s3_path}")

document = tractor.start_document_analysis(
    file_source=s3_path,
    features=[TextractFeatures.FORMS, TextractFeatures.TABLES]
)

# Print extracted forms
print("\n--- Forms ---")
for key_value in document.key_values:
    print(f"{key_value.key}: {key_value.value}")

# Print extracted tables
print("\n--- Tables ---")
for i, table in enumerate(document.tables):
    print(f"Table {i+1}:\n{table.to_csv(include_box=False)}\n")

view raw JSON →