Amazon Textract Parser (textract-trp)
raw JSON → 0.1.3 verified Fri May 01 auth: no python
A parser for Amazon Textract results that converts the raw JSON response into a structured document model with pages, lines, words, tables, and forms. Version 0.1.3 (latest as of verification) supports Python >=3.6. The library provides high-level abstractions for navigating Textract output, including bounding boxes, confidence scores, and relationships between elements. It is maintained on GitHub by mludvig.
pip install textract-trp Common errors
error ModuleNotFoundError: No module named 'textract' ↓
cause Installed wrong package. 'textract' is a different OCR library.
fix
Run 'pip install textract-trp' and import as 'from textract_trp import TextractParser'.
error AttributeError: 'dict' object has no attribute 'pages' ↓
cause Passed a dictionary instead of parsing it with TextractParser.parse().
fix
Use parser = TextractParser(); document = parser.parse(response) where response is the raw boto3 client response.
error KeyError: 'Blocks' in JSON response ↓
cause The input is not a valid Textract response. Possibly the JSON is malformed or an error response.
fix
Check that the response has 'Blocks' key. Call Textract API correctly and ensure no errors in the response.
Warnings
gotcha The library does not handle pagination of Textract responses with multiple pages. You must call Textract with the 'NextToken' yourself and parse each response separately. ↓
fix Loop over responses by passing NextToken from previous response until NextToken is missing.
deprecated Version 0.1.3 uses 'pip install textract-trp' but the package name on PyPI is 'textract-trp'. Some older documentation references 'textract' which is a different library (for OCR). ↓
fix Always use 'pip install textract-trp'. Do not confuse with 'textract' (general OCR) or 'amazon-textract-textractor'.
gotcha TextractParser.parse() expects the raw response dictionary from boto3, not the JSON string. Passing a string will cause JSONDecodeError or attribute errors. ↓
fix Ensure you pass the response object directly from the boto3 client call, e.g., result = client.analyze_document(...); document = parser.parse(result).
breaking In version 0.1.3, the TRP class attributes have changed from previous releases. 'page.lines' and 'page.words' are now objects, not lists of strings. Accessing .text on these objects is correct. ↓
fix Use .text property on line and word objects. For plain list of strings, use list comprehension: [line.text for line in page.lines].
Imports
- TextractParser wrong
from textract.trp import TextractParsercorrectfrom textract_trp import TextractParser - TRP
from textract_trp import TRP
Quickstart
import boto3
from textract_trp import TextractParser
# Initialize Textract client
client = boto3.client('textract', region_name='us-east-1',
aws_access_key_id=os.environ.get('AWS_ACCESS_KEY_ID', ''),
aws_secret_access_key=os.environ.get('AWS_SECRET_ACCESS_KEY', ''))
# Analyze a document from S3
response = client.analyze_document(
Document={'S3Object': {'Bucket': 'my-bucket', 'Name': 'document.pdf'}},
FeatureTypes=['TABLES', 'FORMS']
)
# Parse the response
parser = TextractParser()
document = parser.parse(response)
# Iterate pages and lines
for page in document.pages:
for line in page.lines:
print(line.text)
# Access tables
for page in document.pages:
for table in page.tables:
for row in table.rows:
print([cell.text for cell in row.cells])