Amazon Textract Response Parser for Python

1.0.3 · active · verified Sat Apr 11

The `amazon-textract-response-parser` library for Python simplifies the process of parsing JSON responses returned by Amazon Textract. It converts the raw JSON into programming language-specific constructs, making it easier to work with different parts of a document, such as pages, lines, words, forms, and tables. The current version is 1.0.3, and it is actively maintained by AWS Samples, with releases typically tied to updates in Amazon Textract's capabilities or improvements in parsing logic.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a raw Amazon Textract JSON response into a `TDocument` object and then iterate through its structured elements like pages, lines, and words. It also briefly shows how to serialize the `TDocument` object back into JSON.

import json
from trp.trp2 import TDocument, TDocumentSchema

# Example Textract JSON response (simplified for demonstration)
# In a real scenario, this would come from an Amazon Textract API call
textract_json_response = {
    "DocumentMetadata": {"Pages": 1},
    "Blocks": [
        {
            "BlockType": "PAGE",
            "Geometry": {"BoundingBox": {"Width": 1.0, "Height": 1.0, "Left": 0.0, "Top": 0.0}},
            "Id": "0",
            "Relationships": [{
                "Type": "CHILD",
                "Ids": ["1", "2"]
            }]
        },
        {
            "BlockType": "LINE",
            "Confidence": 99.0,
            "Geometry": {"BoundingBox": {"Width": 0.5, "Height": 0.05, "Left": 0.1, "Top": 0.1}},
            "Id": "1",
            "Text": "Hello, Textract!",
            "Relationships": []
        },
        {
            "BlockType": "WORD",
            "Confidence": 99.0,
            "Geometry": {"BoundingBox": {"Width": 0.2, "Height": 0.03, "Left": 0.1, "Top": 0.1}},
            "Id": "2",
            "Text": "Hello,",
            "Relationships": []
        }
    ]
}

# Deserialize Textract JSON into a TDocument object
t_doc: TDocument = TDocumentSchema().load(textract_json_response)

# Accessing document elements
for page in t_doc.pages:
    print(f"Processing Page: {page.page_number}")
    for line in page.lines:
        print(f"  Line: {line.text} (Confidence: {line.confidence:.2f})")
        for word in line.words:
            print(f"    Word: {word.text} (Confidence: {word.confidence:.2f})")

# Example of serializing the object back to JSON (optional)
# serialized_json = TDocumentSchema().dump(t_doc)
# print(json.dumps(serialized_json, indent=2))

view raw JSON →