ArrayRecord
ArrayRecord is a high-performance file format derived from Riegeli, designed for machine learning workloads. It achieves new frontiers of I/O efficiency by supporting parallel read, write, and random access by record index. The library is currently at version 0.8.3 and appears to have a regular release cadence.
Warnings
- breaking The `__getitem__` method signature changed in `v0.4.0`. It now strictly accepts a single integer index and returns a single record, aligning with Python's standard `__getitem__` behavior. Batching multiple indexes is no longer supported directly via `__getitem__`.
- breaking Prior to `v0.4.0`, there was a transition in how `__getitem__` handled batching (e.g., in `v0.3.0` batching was no longer *required* for good performance but was still implicitly handled, paving the way for the `v0.4.0` strict single-item access). Code relying on `__getitem__` to implicitly handle lists of indices will break.
- gotcha When using `array_record_data_source` for multi-file access and random access, it is crucial that the `ArrayRecordWriter` specified `group_size='group_size:1'` when creating the ArrayRecord files. If not, the `DataSource` may not function as expected or might be inefficient.
- gotcha ArrayRecord requires Python 3.11 or newer for current versions. Older versions (e.g., 0.2.0) supported Python 3.8+.
Install
-
pip install array-record -
pip install array-record[beam]
Imports
- array_record_module
from array_record.python import array_record_module
- array_record_data_source
from array_record.python import array_record_data_source
Quickstart
from array_record.python import array_record_module
import os
# Define output path
output_file = 'output.array_record'
# --- Writing Records ---
# Use `group_size:1` for optimized random access; larger sizes improve sequential/batch access and compression.
writer = array_record_module.ArrayRecordWriter(output_file, 'group_size:1')
for i in range(10):
data = f"Record {i} data".encode('utf-8')
writer.write(data)
writer.close()
print(f"Wrote 10 records to {output_file}")
# --- Reading Records (File-level API) ---
reader = array_record_module.ArrayRecordReader(output_file)
print(f"Reading records from {output_file}:")
for i in range(reader.num_records):
record = reader.read(i)
print(f" Record {i}: {record.decode('utf-8')}")
reader.close()
# --- Reading Records (Multi-file API with DataSource) ---
# Note: For DataSource, the writer MUST specify group_size='group_size:1'
from array_record.python import array_record_data_source
# In a real scenario, you'd have multiple files, e.g., ['file1.array_record', 'file2.array_record']
data_source = array_record_data_source.ArrayRecordDataSource([output_file])
print(f"Reading records using DataSource from {output_file}:")
for i in range(len(data_source)):
record = data_source[i]
print(f" DataSource Record {i}: {record.decode('utf-8')}")
# Clean up the created file
os.remove(output_file)
print(f"Cleaned up {output_file}")