BigQuery Schema Generator
bigquery-schema-generator is a Python library that generates BigQuery schemas from newline-delimited JSON or CSV data. Unlike BigQuery's native auto-detection which typically samples only the first 500 records, this tool processes all input data to create a more comprehensive and accurate schema. Currently at version 1.6.1, the library maintains an active release cadence, providing regular updates and bug fixes.
Warnings
- gotcha Prior to version 1.6.1, repeated type mismatches for a single field in the input data could cause the schema generator to 'forget' the field's type, leading to multiple warnings and potentially an unstable schema deduction. Ensure you are using version 1.6.1 or newer for robust type inference with inconsistent data.
- gotcha As of version 1.6.0, `null` fields are now allowed to convert to `REPEATED` (e.g., an empty list `[]`) to align with how `bq load` interprets null values for array-like fields. This changes the previous behavior where `null` fields would typically be omitted or result in `NULLABLE`. Be aware of this change if your schema generation logic relied on a different interpretation for nulls in potentially repeated fields.
- gotcha When using `SchemaGenerator` with existing BigQuery tables, be mindful of BigQuery's strict rules regarding schema evolution. Specifically, you cannot add `REQUIRED` columns to an existing BigQuery table; new columns must be `NULLABLE` or `REPEATED`. While the library helps generate a schema, attempting to apply a schema with newly introduced `REQUIRED` fields to an existing table will result in an error. The `--infer_mode` flag, when used with CSV, can infer `REQUIRED` fields if all values are non-null.
Install
-
pip install bigquery-schema-generator
Imports
- SchemaGenerator
from bigquery_schema_generator.schema_generator import SchemaGenerator
Quickstart
import json
from bigquery_schema_generator.schema_generator import SchemaGenerator
# Example data as a list of dictionaries
data = [
{
"id": "rec1",
"name": "Alice",
"values": [10, 20]
},
{
"id": "rec2",
"name": "Bob",
"values": [30]
},
{
"id": "rec3",
"name": None, # Will be NULLABLE
"values": [] # Will be REPEATED (empty array)
}
]
# Initialize the schema generator
generator = SchemaGenerator()
# Deduce schema from a list of dictionaries
schema_map = generator.deduce_schema_from_dict(data)
schema = generator.flatten_schema(schema_map)
# Print the generated BigQuery schema in JSON format
print(json.dumps(schema, indent=2))