ArrayRecord

0.8.3 · active · verified Fri Apr 10

ArrayRecord is a high-performance file format derived from Riegeli, designed for machine learning workloads. It achieves new frontiers of I/O efficiency by supporting parallel read, write, and random access by record index. The library is currently at version 0.8.3 and appears to have a regular release cadence.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to write records to an ArrayRecord file and then read them back using both the file-level API (`array_record_module`) and the multi-file API (`array_record_data_source`). It highlights the importance of `group_size` during writing for optimal reading patterns.

from array_record.python import array_record_module
import os

# Define output path
output_file = 'output.array_record'

# --- Writing Records ---
# Use `group_size:1` for optimized random access; larger sizes improve sequential/batch access and compression.
writer = array_record_module.ArrayRecordWriter(output_file, 'group_size:1')
for i in range(10):
    data = f"Record {i} data".encode('utf-8')
    writer.write(data)
writer.close()
print(f"Wrote 10 records to {output_file}")

# --- Reading Records (File-level API) ---
reader = array_record_module.ArrayRecordReader(output_file)
print(f"Reading records from {output_file}:")
for i in range(reader.num_records):
    record = reader.read(i)
    print(f"  Record {i}: {record.decode('utf-8')}")
reader.close()

# --- Reading Records (Multi-file API with DataSource) ---
# Note: For DataSource, the writer MUST specify group_size='group_size:1'
from array_record.python import array_record_data_source

# In a real scenario, you'd have multiple files, e.g., ['file1.array_record', 'file2.array_record']
data_source = array_record_data_source.ArrayRecordDataSource([output_file])

print(f"Reading records using DataSource from {output_file}:")
for i in range(len(data_source)):
    record = data_source[i]
    print(f"  DataSource Record {i}: {record.decode('utf-8')}")

# Clean up the created file
os.remove(output_file)
print(f"Cleaned up {output_file}")

view raw JSON →