arro3-io
arro3-io is a Python library that provides streaming-capable readers and writers for various Apache Arrow-compatible data formats, including Parquet, Arrow IPC, JSON, and CSV. It is an integral part of the `arro3` ecosystem, which aims to be a minimal Python interface to Apache Arrow's Rust implementation, offering a more lightweight alternative to PyArrow. The library emphasizes a streaming-first approach, enabling efficient processing of larger-than-memory datasets through lazy iterators. It is actively maintained, with the current version being 0.8.0, and integrates seamlessly with other Python data libraries that implement the Arrow PyCapsule Interface.
Common errors
-
ModuleNotFoundError: No module named 'arrow3.io'
cause The package name is `arro3-io` (with two 'r's) and its modules are under the `arro3` namespace, not `arrow3`.fixChange your import statement from `import arrow3.io` to `import arro3.io`. Ensure you have installed the correct package: `pip install arro3-io`. -
AttributeError: module 'arro3.io' has no attribute 'Table'
cause The `Table` class, a fundamental Arrow data structure, is provided by the `arro3-core` package, not `arro3-io`.fixImport `Table` from `arro3.core`: `from arro3.core import Table`. You may also need to install `arro3-core` if you haven't already: `pip install arro3-core`. -
TypeError: 'RecordBatchReader' object is not subscriptable
cause You are attempting to access a `RecordBatchReader` (which is an iterator) like a list or array before materializing its contents.fixMaterialize the `RecordBatchReader` into a `Table` or iterate over it. For example, `table = arro3.core.Table(reader)` or `for batch in reader: ...`.
Warnings
- breaking In version 0.8.0, the serialization of a bare `DataType` through `__arrow_c_schema__` (e.g., when passing to `pyarrow.field`) now explicitly sets `nullable: true` to match PyArrow's equality semantics.
- gotcha The `arro3` project is distributed as modular namespace packages (`arro3-core`, `arro3-io`, `arro3-compute`). While `arro3-io` handles I/O, core Arrow data structures like `Table` or `RecordBatch` are provided by `arro3-core`. Users often need to install and import from `arro3-core` for full functionality.
- gotcha `arro3.io`'s read functions (e.g., `read_parquet`) return a `RecordBatchReader`, which is a lazy iterator. If you need to work with the entire dataset in memory, you must explicitly materialize it.
Install
-
pip install arro3-io -
pip install arro3-core arro3-io arro3-compute
Imports
- read_parquet
from arro3.io import read_parquet
- write_parquet
from arro3.io import write_parquet
- read_ipc
from arro3.io import read_ipc
- write_ipc
from arro3.io import write_ipc
- Table
from arro3.io import Table
from arro3.core import Table
Quickstart
import arro3.io
import arro3.core
import pyarrow as pa
import pandas as pd
import io
# 1. Create some dummy data using pandas and pyarrow
df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["A", "B", "C"]})
pa_table = pa.Table.from_pandas(df)
# 2. Write the data to an in-memory buffer as a Parquet file using arro3.io
buffer = io.BytesIO()
arro3.io.write_parquet(pa_table, buffer)
buffer.seek(0)
# 3. Read the Parquet data back from the buffer using arro3.io
# arro3.io.read_parquet returns a RecordBatchReader (an iterator)
reader = arro3.io.read_parquet(buffer)
# 4. Materialize the streaming RecordBatchReader into an arro3.core.Table
arro3_table = arro3.core.Table(reader)
print("Original Pandas DataFrame:")
print(df)
print("\narro3 Table read back:")
print(arro3_table)
# 5. Demonstrate interoperability by converting the arro3.core.Table back to PyArrow and Pandas
print("\narro3 Table converted to PyArrow Table:")
print(arro3_table.to_pyarrow())
print("\narro3 Table converted to Pandas DataFrame:")
print(arro3_table.to_pandas())