Fast Avro for Python
Fastavro is a high-performance Python library for reading and writing Avro files. It provides a significantly faster alternative to the official Apache Avro Python library, leveraging C extensions (Cython) for optimal speed. The library supports various compression codecs and is actively maintained, making it a popular choice for high-throughput Avro serialization and deserialization in Python applications.
Warnings
- breaking The global cache of parsed schemas was removed in version 0.24.0. Code that relied on manipulating or accessing this global cache via `parse_schema` would have broken.
- deprecated Using `python-snappy` for Snappy compression is deprecated. `fastavro` recommends `cramjam` for Snappy, Zstandard, and LZ4 compression due to better compatibility and features.
- gotcha Reading Avro data with an incompatible reader schema (e.g., missing fields, mismatched types) can lead to `SchemaResolutionError` or incorrect data during deserialization.
- gotcha When serializing data, if a field is marked as 'required' in the Avro schema but is missing from the Python dictionary record, `fastavro` will raise an error.
- gotcha When appending records to an existing Avro file, the file must be opened in `a+b` mode (read and append binary). Passing `None` as the schema to the `writer` function is recommended, as the existing file's schema will be reused. Using `ab` mode or providing a schema will likely lead to errors.
- gotcha Using `parse_schema(..., expand=True)` generates a schema that may not fully conform to the Avro specification for all scenarios, especially when dealing with referenced schemas. The output of this function with `expand=True` should generally be considered for output/inspection only and not passed directly to `reader` or `writer` functions, as it might cause exceptions.
- gotcha When reading a union of records, if `return_record_name=True` is specified in `reader()`, the result for a union type will be a tuple `(record_name, record_value)`. If a union contains only one record type, `return_record_name_override=True` can modify this behavior to return just the record value, without the name tuple.
Install
-
pip install fastavro
Imports
- writer
from fastavro import writer
- reader
from fastavro import reader
- parse_schema
from fastavro import parse_schema
Quickstart
import io
from fastavro import writer, reader, parse_schema
# 1. Define an Avro schema
schema = {
'doc': 'A simple user record.',
'name': 'User',
'namespace': 'example.avro',
'type': 'record',
'fields': [
{'name': 'name', 'type': 'string'},
{'name': 'favorite_number', 'type': ['int', 'null'], 'default': None},
{'name': 'favorite_color', 'type': ['string', 'null'], 'default': 'green'}
]
}
# It's optional but recommended to parse the schema once for performance
parsed_schema = parse_schema(schema)
# 2. Prepare some records
records = [
{'name': 'Alice', 'favorite_number': 256, 'favorite_color': 'blue'},
{'name': 'Bob', 'favorite_number': 7, 'favorite_color': None},
{'name': 'Charlie', 'favorite_number': None, 'favorite_color': 'red'}
]
# 3. Write records to an in-memory Avro file (BytesIO)
bytes_writer = io.BytesIO()
writer(bytes_writer, parsed_schema, records, codec='deflate')
# 4. Read records back from the in-memory Avro file
bytes_writer.seek(0) # Rewind the buffer to the beginning
avro_reader = reader(bytes_writer)
read_records = []
for record in avro_reader:
read_records.append(record)
print("Original Records:", records)
print("Read Records:", read_records)
# Verify that read records match original records
assert records == read_records
print("Successfully wrote and read Avro records!")