Adjust Precision for Schema

raw JSON →
0.3.4 verified Mon Apr 13 auth: no python

This library (version 0.3.4) is designed for use in Singer.io data integration targets to address and overcome precision differences that can arise between various data source systems, Python's native numeric types, and target data warehouses or databases. It aims to ensure data consistency and accuracy, particularly for decimal and floating-point numbers, during the ETL process. The release cadence appears to be irregular, based on available PyPI data.

pip install adjust-precision-for-schema
error ModuleNotFoundError: No module named 'adjust_precision_for_schema'
cause The module name is incorrect due to underscores instead of hyphens.
fix
Use the correct import statement: 'import adjust-precision-for-schema'.
error ImportError: cannot import name 'adjust_precision_for_schema' from 'adjust-precision-for-schema'
cause The module name is incorrect due to hyphens instead of underscores.
fix
Use the correct import statement: 'import adjust_precision_for_schema'.
error ValueError: Decimal type with precision 7 does not fit into precision inferred from first array element: 8
cause The decimal precision in the data exceeds the defined precision in the schema.
fix
Ensure that the decimal precision in the data matches the precision defined in the schema.
error pyarrow.lib.ArrowInvalid: Decimal type with precision 7 does not fit into precision inferred from first array element: 8
cause The decimal precision in the data exceeds the defined precision in the schema.
fix
Ensure that the decimal precision in the data matches the precision defined in the schema.
error org.apache.avro.AvroTypeException: Cannot encode decimal with scale 10 as scale 11
cause The scale of the decimal value does not match the expected scale in the schema.
fix
Adjust the scale of the decimal value to match the expected scale in the schema.
gotcha Without an explicit `multipleOf` or `precision`/`scale` definition in your Singer.io JSON Schema, the library may not be able to correctly infer the desired precision for numeric fields. Ensure your schemas are as explicit as possible for critical numeric types.
fix Explicitly define `multipleOf` (e.g., `0.01` for two decimal places) for `number` types in your JSON Schema that represent decimals, or leverage Singer-specific extensions for `precision` and `scale` if the target supports them.
gotcha Floating-point inaccuracies in Python can lead to unexpected rounding behavior. While this library aims to mitigate this, always test the precision adjustments with edge cases (e.g., `X.Y4999` vs `X.Y5000`) to ensure desired rounding.
fix Use Python's `decimal` module for internal representation and calculations when absolute precision is critical, and ensure the library's internal logic aligns with the target system's rounding rules (e.g., HALF_UP, HALF_EVEN).
gotcha Schema evolution and changes in source data precision can silently break downstream data pipelines if not properly managed. Relying solely on automatic precision adjustment without validation can mask underlying data quality issues.
fix Implement robust schema validation and data quality checks in your Singer.io pipeline. Monitor data for unexpected precision changes at both the source and after adjustment. Consider versioning your schemas and communicating changes to consumers.

This example demonstrates how the `adjust_precision` function (hypothesized based on the library's purpose) might be used within a Singer.io data pipeline. It takes a data record and a JSON Schema, adjusting numeric values within the record to conform to the precision and scale implied by the schema, particularly for fields marked with `"_singer_type": "decimal"` and `"multipleOf"`.

import json
from adjust_precision_for_schema import adjust_precision

# Example Singer SCHEMA message (simplified)
# This schema defines a 'price' field with a logical 'decimal' type
# and an implied precision/scale (e.g., up to 2 decimal places).
schema_message = {
    "type": "SCHEMA",
    "stream": "products",
    "schema": {
        "type": "object",
        "properties": {
            "id": {"type": "integer"},
            "name": {"type": "string"},
            "price": {
                "type": ["number", "null"],
                ""_singer_type": "decimal",
                ""maximum": 1000000000000000000000000000000000000.00,
                ""multipleOf": 0.01
            }
        }
    },
    "key_properties": ["id"]
}

# Example Singer RECORD message
record_message = {
    "type": "RECORD",
    "stream": "products",
    "record": {
        "id": 1,
        "name": "Product A",
        "price": 123.456789  # Value with more precision than schema intends
    }
}

# Another record with a value that should be adjusted minimally
record_message_2 = {
    "type": "RECORD",
    "stream": "products",
    "record": {
        "id": 2,
        "name": "Product B",
        "price": 99.99999999999999 # Value that should round up
    }
}

# Hypothetical function call to adjust precision based on the schema
# The exact API (e.g., arguments, return type) is inferred.
adjusted_record_1 = adjust_precision(record_message['record'], schema_message['schema'])
adjusted_record_2 = adjust_precision(record_message_2['record'], schema_message['schema'])

print("Original Record 1 Price:", record_message['record']['price'])
print("Adjusted Record 1 Price:", adjusted_record_1['price'])

print("Original Record 2 Price:", record_message_2['record']['price'])
print("Adjusted Record 2 Price:", adjusted_record_2['price'])