Avro Record Class and Specific Record Reader Generator
avro-gen3 is a Python library that generates concrete Avro record classes with type hints and a specific record reader. It addresses the typeless nature of default Avro Python implementations by wrapping the standard Avro DatumReader to return these type-hinted classes instead of generic dictionaries. This project is a fork of `avro_gen`, enhanced with improved Python 3 support, better namespace handling, documentation generation, and JSON (de-)serialization capabilities. The current version is 0.7.16, released on September 5, 2024, indicating an active but irregular release cycle.
Warnings
- gotcha avro-gen3 generates specific record classes as `DictWrapper` instances and does NOT provide an overloaded `DictWriter`. This means that generated specific records, while offering type-hinted access, behave like regular Python dictionaries for serialization purposes with standard Avro `DatumWriter`.
- breaking Breaking change in `apache-avro` versions 1.10 and later moved `AvroTypeException` to a different package, which can cause `AttributeError: module 'avro.io' has no attribute 'AvroTypeException'` if `avro-gen3` generated code (or its dependencies) expects the old location. This often manifests when custom properties contain non-string values.
- gotcha When defining optional fields in Avro schemas, the `type` must be a union with `"null"` as the *first* type, and a `default` value must be specified as the literal `null` (not the string `"null"`). Incorrectly formatted optional fields can lead to consumer-side exceptions even if the schema appears valid for encoding.
- gotcha Generated Avro classes are organized into submodules reflecting their Avro namespaces within the output directory. Importing them requires correctly constructing the Python import path based on the Avro namespace and the chosen output directory.
Install
-
pip install avro-gen3
Imports
- write_schema_files
from avrogen import write_schema_files
- GeneratedRecordClass
from <output_directory>.<avro_namespace_path> import <RecordName>
- SpecificDatumReader
from <output_directory> import SpecificDatumReader
Quickstart
import os
import sys
import tempfile
from pathlib import Path
from avrogen import write_schema_files
# 1. Define a simple Avro schema
avro_schema_json = '''
{
"type": "record",
"name": "User",
"namespace": "com.example.app",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"], "default": null}
]
}
'''
# 2. Define an output directory for generated classes
with tempfile.TemporaryDirectory() as tmpdir_name:
output_dir = Path(tmpdir_name)
print(f"Generated Avro classes will be written to: {output_dir}")
# 3. Generate Python classes from the Avro schema
write_schema_files(avro_schema_json, output_dir)
# Add the output directory to sys.path to enable import
sys.path.insert(0, str(output_dir))
try:
# 4. Import the generated classes and reader
# The Avro namespace 'com.example.app' translates to a path within the output_dir
from com.example.app import User # Access the generated User class
from avro.io import DatumWriter, DatumReader
from avro.datafile import DataFileWriter, DataFileReader
# 5. Create an instance of the generated class
user_record = User(name="Alice", favorite_number=123)
print(f"Created user record: {user_record}")
print(f"User name: {user_record.name}, Favorite number: {user_record.favorite_number}")
# 6. Serialize and deserialize using standard Avro tools with the generated schema/classes
# Note: avro-gen3 wraps DatumReader but for DataFileWriter/Reader, you still use avro's types
# For simpler examples, we might use the original avro library's DatumWriter/Reader directly
# The main benefit of avro-gen3 is the type-hinted classes.
# The generated classes are DictWrapper instances, compatible with standard Avro I/O
output_file = output_dir / "users.avro"
writer = DataFileWriter(open(output_file, "wb"), DatumWriter(), user_record.SCHEMA)
writer.append(user_record._inner_dict) # avro-gen3 records are dict wrappers
writer.close()
reader = DataFileReader(open(output_file, "rb"), DatumReader())
for read_user_dict in reader:
# When reading back, DatumReader returns dicts. You'd re-wrap if desired.
read_user = User(**read_user_dict)
print(f"Deserialized user: {read_user.name}, {read_user.favorite_number}")
reader.close()
finally:
# Clean up sys.path
sys.path.remove(str(output_dir))