Amazon Textract Response Parser for Python
raw JSON → 1.0.3 verified Thu May 14 auth: no python
The `amazon-textract-response-parser` library for Python simplifies the process of parsing JSON responses returned by Amazon Textract. It converts the raw JSON into programming language-specific constructs, making it easier to work with different parts of a document, such as pages, lines, words, forms, and tables. The current version is 1.0.3, and it is actively maintained by AWS Samples, with releases typically tied to updates in Amazon Textract's capabilities or improvements in parsing logic.
pip install amazon-textract-response-parser Common errors
error ModuleNotFoundError: No module named 'amazon-textract-response-parser' ↓
cause The 'amazon-textract-response-parser' library is not installed in the Python environment.
fix
Install the library using pip: 'pip install amazon-textract-response-parser'.
error AttributeError: module 'trp' has no attribute 'Document' ↓
cause The 'trp' module is not correctly imported or the 'Document' class is not available in the module.
fix
Ensure the library is installed and import it correctly: 'from trp import Document'.
error TypeError: 'NoneType' object is not iterable ↓
cause The Textract response does not contain the expected data, possibly due to an empty or invalid response.
fix
Verify that the Textract response is valid and contains the expected data before parsing.
error ValueError: Invalid JSON response from Textract ↓
cause The response from Amazon Textract is not a valid JSON object, possibly due to an error in the Textract operation.
fix
Check the Textract operation for errors and ensure the response is a valid JSON object before parsing.
error KeyError: 'Blocks' ↓
cause The 'Blocks' key is missing in the Textract response, indicating an incomplete or malformed response.
fix
Ensure that the Textract operation completes successfully and returns a response containing the 'Blocks' key.
Warnings
gotcha When processing multi-page Textract responses, especially those downloaded from S3 or split into multiple files, it's crucial to ensure they are loaded into the `TextractDocument` (or `TDocument`) constructor as an array of responses in the *correct page order*. If the order is incorrect (e.g., due to alphabetical file sorting like '1.json', '11.json', '2.json'), ID associations across pages may break, leading to parsing errors or incorrect document structure. ↓
fix Manually sort the list of Textract JSON responses by page number before passing them to the `TDocumentSchema().load()` method or the `TDocument` constructor when dealing with multi-page documents.
gotcha The Amazon Textract service itself occasionally updates its JSON response schema, particularly for complex structures like tables (e.g., adding `MERGED_CELLS` or `COLUMN_HEADER` entity types). While `amazon-textract-response-parser` aims to abstract these, if you are working with older Textract responses or a specific Textract model version, you might encounter slight differences in the parsed object structure compared to the latest Textract output. This library usually incorporates updates to handle new Textract features, but ensure your library version is compatible with the Textract service response you are parsing. ↓
fix Regularly update `amazon-textract-response-parser` to the latest version to ensure compatibility with the most recent Textract service responses. Consult the library's GitHub issues or releases for notes on specific Textract service schema changes.
deprecated Earlier versions of Textract Response Parser for Python and JavaScript/TypeScript might have substantially different APIs and available features. While this warning primarily targets migration between language implementations, it implies that the Python API itself might evolve. Direct migration of code relying on older Python TRP APIs to newer versions should be done with care. ↓
fix Refer to the specific version's `README.md` and release notes on GitHub for API compatibility details when upgrading across significant minor or major versions. The `trp.trp2` module is the current standard for Python.
Install compatibility last tested: 2026-05-14 v1.0.3 (up to date)
python os / libc status wheel install import disk mem side effects
3.10 alpine (musl) wheel - 0.22s 52.0M 8.0M clean
3.10 alpine (musl) - - 0.25s 51.9M 8.0M -
3.10 slim (glibc) wheel 4.2s 0.17s 53M 8.0M clean
3.10 slim (glibc) - - 0.15s 52M 8.0M -
3.11 alpine (musl) wheel - 0.33s 55.3M 8.9M clean
3.11 alpine (musl) - - 0.39s 55.2M 8.9M -
3.11 slim (glibc) wheel 4.0s 0.30s 56M 8.9M clean
3.11 slim (glibc) - - 0.29s 56M 8.9M -
3.12 alpine (musl) wheel - 0.28s 46.8M 8.8M clean
3.12 alpine (musl) - - 0.34s 46.7M 8.8M -
3.12 slim (glibc) wheel 3.4s 0.29s 47M 8.8M clean
3.12 slim (glibc) - - 0.33s 47M 8.8M -
3.13 alpine (musl) wheel - 0.26s 46.6M 8.8M clean
3.13 alpine (musl) - - 0.28s 46.4M 8.8M -
3.13 slim (glibc) wheel 3.1s 0.26s 47M 8.8M clean
3.13 slim (glibc) - - 0.30s 47M 8.8M -
3.9 alpine (musl) wheel - 0.17s 51.4M 7.2M clean
3.9 alpine (musl) - - 0.20s 51.4M 7.2M -
3.9 slim (glibc) wheel 4.6s 0.15s 52M 7.2M clean
3.9 slim (glibc) - - 0.20s 52M 7.2M -
Imports
- TDocument
from trp.trp2 import TDocument - TDocumentSchema
from trp.trp2 import TDocumentSchema - TAnalyzeIdDocument
from trp.trp2_analyzeid import TAnalyzeIdDocument - TAnalyzeIdDocumentSchema
from trp.trp2_analyzeid import TAnalyzeIdDocumentSchema
Quickstart last tested: 2026-04-25
import json
from trp.trp2 import TDocument, TDocumentSchema
# Example Textract JSON response (simplified for demonstration)
# In a real scenario, this would come from an Amazon Textract API call
textract_json_response = {
"DocumentMetadata": {"Pages": 1},
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {"BoundingBox": {"Width": 1.0, "Height": 1.0, "Left": 0.0, "Top": 0.0}},
"Id": "0",
"Relationships": [{
"Type": "CHILD",
"Ids": ["1", "2"]
}]
},
{
"BlockType": "LINE",
"Confidence": 99.0,
"Geometry": {"BoundingBox": {"Width": 0.5, "Height": 0.05, "Left": 0.1, "Top": 0.1}},
"Id": "1",
"Text": "Hello, Textract!",
"Relationships": []
},
{
"BlockType": "WORD",
"Confidence": 99.0,
"Geometry": {"BoundingBox": {"Width": 0.2, "Height": 0.03, "Left": 0.1, "Top": 0.1}},
"Id": "2",
"Text": "Hello,",
"Relationships": []
}
]
}
# Deserialize Textract JSON into a TDocument object
t_doc: TDocument = TDocumentSchema().load(textract_json_response)
# Accessing document elements
for page in t_doc.pages:
print(f"Processing Page: {page.page_number}")
for line in page.lines:
print(f" Line: {line.text} (Confidence: {line.confidence:.2f})")
for word in line.words:
print(f" Word: {word.text} (Confidence: {word.confidence:.2f})")
# Example of serializing the object back to JSON (optional)
# serialized_json = TDocumentSchema().dump(t_doc)
# print(json.dumps(serialized_json, indent=2))