Amazon Textract Response Parser for Python
The `amazon-textract-response-parser` library for Python simplifies the process of parsing JSON responses returned by Amazon Textract. It converts the raw JSON into programming language-specific constructs, making it easier to work with different parts of a document, such as pages, lines, words, forms, and tables. The current version is 1.0.3, and it is actively maintained by AWS Samples, with releases typically tied to updates in Amazon Textract's capabilities or improvements in parsing logic.
Warnings
- gotcha When processing multi-page Textract responses, especially those downloaded from S3 or split into multiple files, it's crucial to ensure they are loaded into the `TextractDocument` (or `TDocument`) constructor as an array of responses in the *correct page order*. If the order is incorrect (e.g., due to alphabetical file sorting like '1.json', '11.json', '2.json'), ID associations across pages may break, leading to parsing errors or incorrect document structure.
- gotcha The Amazon Textract service itself occasionally updates its JSON response schema, particularly for complex structures like tables (e.g., adding `MERGED_CELLS` or `COLUMN_HEADER` entity types). While `amazon-textract-response-parser` aims to abstract these, if you are working with older Textract responses or a specific Textract model version, you might encounter slight differences in the parsed object structure compared to the latest Textract output. This library usually incorporates updates to handle new Textract features, but ensure your library version is compatible with the Textract service response you are parsing.
- deprecated Earlier versions of Textract Response Parser for Python and JavaScript/TypeScript might have substantially different APIs and available features. While this warning primarily targets migration between language implementations, it implies that the Python API itself might evolve. Direct migration of code relying on older Python TRP APIs to newer versions should be done with care.
Install
-
pip install amazon-textract-response-parser
Imports
- TDocument
from trp.trp2 import TDocument
- TDocumentSchema
from trp.trp2 import TDocumentSchema
- TAnalyzeIdDocument
from trp.trp2_analyzeid import TAnalyzeIdDocument
- TAnalyzeIdDocumentSchema
from trp.trp2_analyzeid import TAnalyzeIdDocumentSchema
Quickstart
import json
from trp.trp2 import TDocument, TDocumentSchema
# Example Textract JSON response (simplified for demonstration)
# In a real scenario, this would come from an Amazon Textract API call
textract_json_response = {
"DocumentMetadata": {"Pages": 1},
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {"BoundingBox": {"Width": 1.0, "Height": 1.0, "Left": 0.0, "Top": 0.0}},
"Id": "0",
"Relationships": [{
"Type": "CHILD",
"Ids": ["1", "2"]
}]
},
{
"BlockType": "LINE",
"Confidence": 99.0,
"Geometry": {"BoundingBox": {"Width": 0.5, "Height": 0.05, "Left": 0.1, "Top": 0.1}},
"Id": "1",
"Text": "Hello, Textract!",
"Relationships": []
},
{
"BlockType": "WORD",
"Confidence": 99.0,
"Geometry": {"BoundingBox": {"Width": 0.2, "Height": 0.03, "Left": 0.1, "Top": 0.1}},
"Id": "2",
"Text": "Hello,",
"Relationships": []
}
]
}
# Deserialize Textract JSON into a TDocument object
t_doc: TDocument = TDocumentSchema().load(textract_json_response)
# Accessing document elements
for page in t_doc.pages:
print(f"Processing Page: {page.page_number}")
for line in page.lines:
print(f" Line: {line.text} (Confidence: {line.confidence:.2f})")
for word in line.words:
print(f" Word: {word.text} (Confidence: {word.confidence:.2f})")
# Example of serializing the object back to JSON (optional)
# serialized_json = TDocumentSchema().dump(t_doc)
# print(json.dumps(serialized_json, indent=2))