Pandavro
Pandavro provides a convenient interface to read and write Avro files using pandas DataFrames. It simplifies the serialization and deserialization of tabular data between Python's pandas library and the Avro data format. The current version is 1.9.0, and it maintains an active release schedule with updates for Python, pandas, and NumPy compatibility.
Common errors
-
fastavro.validation.ValidationError: The datum ... is not of type ...
cause Data in a DataFrame column does not conform to the Avro schema inferred by `pandavro` or to an expected schema. This often happens due to mixed types in a column or unexpected `NaN`/`None` values.fixEnsure DataFrame columns have consistent dtypes. Handle `NaN`/`None` values explicitly (e.g., `df.fillna(value)` or converting to nullable pandas dtypes like `pd.Int64Dtype()`) or pre-process `object` columns to ensure all values are of a consistent, Avro-compatible type. -
AttributeError: 'dict' object has no attribute 'items'
cause Attempting to pass a Python dictionary, a pandas Series, or another non-DataFrame object directly to `pandavro.to_avro()`, which expects a `pandas.DataFrame` as its second argument.fixEnsure the second argument to `pandavro.to_avro()` is always a `pandas.DataFrame` object. Convert dictionaries or Series to DataFrames first (e.g., `pd.DataFrame(your_dict)` or `your_series.to_frame()`). -
ValueError: Cannot convert object of type <class 'some_complex_type'> to Avro type
cause A column in the DataFrame contains complex Python objects (e.g., custom classes, unflattened nested lists/dicts, or non-standard types) that `pandavro` cannot automatically map to a standard Avro type.fixPre-process the DataFrame to flatten complex structures or convert custom objects into basic, Avro-compatible types such as strings (e.g., by serializing them to JSON strings), integers, or floats before calling `pandavro.to_avro()`.
Warnings
- breaking Version 1.9.0 introduces official support for pandas 2.0 and NumPy 2.0. While `pandavro` itself is adapted, upgrading these underlying libraries in your environment might introduce breaking changes in your own code, especially regarding pandas' copy-on-write behavior or NumPy's API changes.
- gotcha `pandavro` infers Avro schemas from pandas DataFrames. This inference might not perfectly align with pre-existing Avro schemas or desired Avro types, particularly for mixed-type columns, generic `object` dtypes, or specific handling of `NaN`/`None` values, leading to unexpected schemas or data type conversions.
- gotcha Handling of `NaN` (Not a Number) and `None` values can lead to subtle issues. `pandavro` typically maps `NaN` in numeric columns to `null` within an Avro union type (e.g., `["null", "double"]`). However, `None` in `object` columns might result in `string` or `bytes` types depending on other data, potentially causing schema mismatches.
Install
-
pip install pandavro
Imports
- to_avro
import pandavro as pa pa.to_avro(...)
- read_avro
import pandavro as pa pa.read_avro(...)
Quickstart
import pandas as pd
import pandavro as pa
import io
# 1. Create a pandas DataFrame
df = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'value': [10.1, 20.2, 30.3]
})
print("Original DataFrame:")
print(df)
# 2. Write DataFrame to an Avro file (using BytesIO for in-memory example)
output_buffer = io.BytesIO()
pa.to_avro(output_buffer, df, name="my_record") # 'name' is recommended for the root record
# 3. Read Avro data back into a DataFrame
output_buffer.seek(0) # Reset buffer position for reading
read_df = pa.read_avro(output_buffer)
print("\nRead DataFrame from Avro:")
print(read_df)
# You can also use file paths directly:
# pa.to_avro('output.avro', df)
# loaded_df = pa.read_avro('output.avro')