arro3-io

0.8.0 · active · verified Thu Apr 16

arro3-io is a Python library that provides streaming-capable readers and writers for various Apache Arrow-compatible data formats, including Parquet, Arrow IPC, JSON, and CSV. It is an integral part of the `arro3` ecosystem, which aims to be a minimal Python interface to Apache Arrow's Rust implementation, offering a more lightweight alternative to PyArrow. The library emphasizes a streaming-first approach, enabling efficient processing of larger-than-memory datasets through lazy iterators. It is actively maintained, with the current version being 0.8.0, and integrates seamlessly with other Python data libraries that implement the Arrow PyCapsule Interface.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `arro3-io` to write and read Apache Arrow-compatible data. It showcases the creation of data using Pandas and PyArrow, writing it to an in-memory buffer using `arro3.io.write_parquet`, then reading it back with `arro3.io.read_parquet`. The streaming `RecordBatchReader` is then materialized into an `arro3.core.Table`, and finally converted back to PyArrow and Pandas to highlight interoperability.

import arro3.io
import arro3.core
import pyarrow as pa
import pandas as pd
import io

# 1. Create some dummy data using pandas and pyarrow
df = pd.DataFrame({"col1": [1, 2, 3], "col2": ["A", "B", "C"]})
pa_table = pa.Table.from_pandas(df)

# 2. Write the data to an in-memory buffer as a Parquet file using arro3.io
buffer = io.BytesIO()
arro3.io.write_parquet(pa_table, buffer)
buffer.seek(0)

# 3. Read the Parquet data back from the buffer using arro3.io
# arro3.io.read_parquet returns a RecordBatchReader (an iterator)
reader = arro3.io.read_parquet(buffer)

# 4. Materialize the streaming RecordBatchReader into an arro3.core.Table
arro3_table = arro3.core.Table(reader)

print("Original Pandas DataFrame:")
print(df)
print("\narro3 Table read back:")
print(arro3_table)

# 5. Demonstrate interoperability by converting the arro3.core.Table back to PyArrow and Pandas
print("\narro3 Table converted to PyArrow Table:")
print(arro3_table.to_pyarrow())
print("\narro3 Table converted to Pandas DataFrame:")
print(arro3_table.to_pandas())

view raw JSON →