{"id":2861,"library":"amazon-textract-response-parser","title":"Amazon Textract Response Parser for Python","description":"The `amazon-textract-response-parser` library for Python simplifies the process of parsing JSON responses returned by Amazon Textract. It converts the raw JSON into programming language-specific constructs, making it easier to work with different parts of a document, such as pages, lines, words, forms, and tables. The current version is 1.0.3, and it is actively maintained by AWS Samples, with releases typically tied to updates in Amazon Textract's capabilities or improvements in parsing logic.","status":"active","version":"1.0.3","language":"en","source_language":"en","source_url":"https://github.com/aws-samples/amazon-textract-response-parser","tags":["aws","amazon","textract","ocr","document analysis","parser","json"],"install":[{"cmd":"pip install amazon-textract-response-parser","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Used for serialization/deserialization of Textract JSON responses into Python objects.","package":"marshmallow","optional":false}],"imports":[{"symbol":"TDocument","correct":"from trp.trp2 import TDocument"},{"symbol":"TDocumentSchema","correct":"from trp.trp2 import TDocumentSchema"},{"symbol":"TAnalyzeIdDocument","correct":"from trp.trp2_analyzeid import TAnalyzeIdDocument"},{"symbol":"TAnalyzeIdDocumentSchema","correct":"from trp.trp2_analyzeid import TAnalyzeIdDocumentSchema"}],"quickstart":{"code":"import json\nfrom trp.trp2 import TDocument, TDocumentSchema\n\n# Example Textract JSON response (simplified for demonstration)\n# In a real scenario, this would come from an Amazon Textract API call\ntextract_json_response = {\n    \"DocumentMetadata\": {\"Pages\": 1},\n    \"Blocks\": [\n        {\n            \"BlockType\": \"PAGE\",\n            \"Geometry\": {\"BoundingBox\": {\"Width\": 1.0, \"Height\": 1.0, \"Left\": 0.0, \"Top\": 0.0}},\n            \"Id\": \"0\",\n            \"Relationships\": [{\n                \"Type\": \"CHILD\",\n                \"Ids\": [\"1\", \"2\"]\n            }]\n        },\n        {\n            \"BlockType\": \"LINE\",\n            \"Confidence\": 99.0,\n            \"Geometry\": {\"BoundingBox\": {\"Width\": 0.5, \"Height\": 0.05, \"Left\": 0.1, \"Top\": 0.1}},\n            \"Id\": \"1\",\n            \"Text\": \"Hello, Textract!\",\n            \"Relationships\": []\n        },\n        {\n            \"BlockType\": \"WORD\",\n            \"Confidence\": 99.0,\n            \"Geometry\": {\"BoundingBox\": {\"Width\": 0.2, \"Height\": 0.03, \"Left\": 0.1, \"Top\": 0.1}},\n            \"Id\": \"2\",\n            \"Text\": \"Hello,\",\n            \"Relationships\": []\n        }\n    ]\n}\n\n# Deserialize Textract JSON into a TDocument object\nt_doc: TDocument = TDocumentSchema().load(textract_json_response)\n\n# Accessing document elements\nfor page in t_doc.pages:\n    print(f\"Processing Page: {page.page_number}\")\n    for line in page.lines:\n        print(f\"  Line: {line.text} (Confidence: {line.confidence:.2f})\")\n        for word in line.words:\n            print(f\"    Word: {word.text} (Confidence: {word.confidence:.2f})\")\n\n# Example of serializing the object back to JSON (optional)\n# serialized_json = TDocumentSchema().dump(t_doc)\n# print(json.dumps(serialized_json, indent=2))","lang":"python","description":"This quickstart demonstrates how to load a raw Amazon Textract JSON response into a `TDocument` object and then iterate through its structured elements like pages, lines, and words. It also briefly shows how to serialize the `TDocument` object back into JSON."},"warnings":[{"fix":"Manually sort the list of Textract JSON responses by page number before passing them to the `TDocumentSchema().load()` method or the `TDocument` constructor when dealing with multi-page documents.","message":"When processing multi-page Textract responses, especially those downloaded from S3 or split into multiple files, it's crucial to ensure they are loaded into the `TextractDocument` (or `TDocument`) constructor as an array of responses in the *correct page order*. If the order is incorrect (e.g., due to alphabetical file sorting like '1.json', '11.json', '2.json'), ID associations across pages may break, leading to parsing errors or incorrect document structure.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Regularly update `amazon-textract-response-parser` to the latest version to ensure compatibility with the most recent Textract service responses. Consult the library's GitHub issues or releases for notes on specific Textract service schema changes.","message":"The Amazon Textract service itself occasionally updates its JSON response schema, particularly for complex structures like tables (e.g., adding `MERGED_CELLS` or `COLUMN_HEADER` entity types). While `amazon-textract-response-parser` aims to abstract these, if you are working with older Textract responses or a specific Textract model version, you might encounter slight differences in the parsed object structure compared to the latest Textract output. This library usually incorporates updates to handle new Textract features, but ensure your library version is compatible with the Textract service response you are parsing.","severity":"gotcha","affected_versions":"Potentially all versions, depending on Textract service updates."},{"fix":"Refer to the specific version's `README.md` and release notes on GitHub for API compatibility details when upgrading across significant minor or major versions. The `trp.trp2` module is the current standard for Python.","message":"Earlier versions of Textract Response Parser for Python and JavaScript/TypeScript might have substantially different APIs and available features. While this warning primarily targets migration between language implementations, it implies that the Python API itself might evolve. Direct migration of code relying on older Python TRP APIs to newer versions should be done with care.","severity":"deprecated","affected_versions":"Pre-1.0.0 (and potentially minor API changes in 1.x.x)"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}