Apache Avro Python
Avro is a data serialization and RPC framework for various languages, including Python. It uses JSON for defining data types and protocols and serializes data in a compact binary format. The Python library provides tools for schema parsing, binary encoding/decoding, and working with Avro Data Files. The current version is 1.12.1, with releases typically occurring a few times a year for minor or patch updates.
Warnings
- deprecated The `avro-python3` PyPI package is deprecated. Users should now install and use the `avro` package, which supports both Python 2 (legacy) and Python 3. The `avro-python3` package will be removed in the near future.
- gotcha Installing the `avro` package (intended for Python 3+) in older Python 2 environments or incorrectly expecting Python 2 behavior in Python 3 can lead to `SyntaxError` due to incompatible syntax (e.g., `except Exception, e:`).
- gotcha The official Python Avro library is implemented in pure Python, which can lead to slow performance when processing large volumes of data or complex schemas. This is a common pain point for users.
- gotcha When reading Avro files, the reader's schema must be compatible with the writer's schema, adhering to Avro's schema evolution rules. Mismatched or missing fields (especially required ones) between reader and writer schemas can lead to errors or unexpected data during deserialization.
Install
-
pip install avro -
pip install avro[snappy,zstandard]
Imports
- avro.schema.parse
import avro.schema schema = avro.schema.parse(json_schema_string)
- DataFileReader
from avro.datafile import DataFileReader
- DataFileWriter
from avro.datafile import DataFileWriter
- DatumReader
from avro.io import DatumReader
- DatumWriter
from avro.io import DatumWriter
Quickstart
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
import io
# Define schema
schema_str = '''
{
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
'''
schema = avro.schema.parse(schema_str)
# Prepare data
users = [
{"name": "Alyssa", "favorite_number": 256, "favorite_color": "red"},
{"name": "Ben", "favorite_number": 7, "favorite_color": "blue"},
{"name": "Charlie", "favorite_number": null, "favorite_color": "green"},
{"name": "David", "favorite_number": 42, "favorite_color": null}
]
# Write data to an in-memory Avro file
# Using io.BytesIO for an in-memory file-like object
output_stream = io.BytesIO()
writer = DataFileWriter(output_stream, DatumWriter(), schema)
for user in users:
writer.append(user)
writer.close()
# Reset stream position to read from the beginning
output_stream.seek(0)
# Read data from the in-memory Avro file
reader = DataFileReader(output_stream, DatumReader())
print("Reading Avro data:")
for user in reader:
print(user)
reader.close()
output_stream.close()