{"id":2391,"library":"array-record","title":"ArrayRecord","description":"ArrayRecord is a high-performance file format derived from Riegeli, designed for machine learning workloads. It achieves new frontiers of I/O efficiency by supporting parallel read, write, and random access by record index. The library is currently at version 0.8.3 and appears to have a regular release cadence.","status":"active","version":"0.8.3","language":"en","source_language":"en","source_url":"https://github.com/google/array_record","tags":["data storage","io","performance","machine learning","record format","riegeli"],"install":[{"cmd":"pip install array-record","lang":"bash","label":"Basic Installation"},{"cmd":"pip install array-record[beam]","lang":"bash","label":"For Apache Beam Integration"}],"dependencies":[{"reason":"Required runtime dependency for core functionality.","package":"absl-py","optional":false},{"reason":"Required runtime dependency for core functionality.","package":"etils","optional":false},{"reason":"Optional dependency for Apache Beam integration.","package":"apache-beam","optional":true},{"reason":"Optional dependency for integration with the Grain data loading library.","package":"grain","optional":true}],"imports":[{"note":"Provides file-level read/write APIs, with a one-to-one mapping to the underlying C++ API.","symbol":"array_record_module","correct":"from array_record.python import array_record_module"},{"note":"Wraps `array_record_module` for convenient access to multiple ArrayRecord files.","symbol":"array_record_data_source","correct":"from array_record.python import array_record_data_source"}],"quickstart":{"code":"from array_record.python import array_record_module\nimport os\n\n# Define output path\noutput_file = 'output.array_record'\n\n# --- Writing Records ---\n# Use `group_size:1` for optimized random access; larger sizes improve sequential/batch access and compression.\nwriter = array_record_module.ArrayRecordWriter(output_file, 'group_size:1')\nfor i in range(10):\n    data = f\"Record {i} data\".encode('utf-8')\n    writer.write(data)\nwriter.close()\nprint(f\"Wrote 10 records to {output_file}\")\n\n# --- Reading Records (File-level API) ---\nreader = array_record_module.ArrayRecordReader(output_file)\nprint(f\"Reading records from {output_file}:\")\nfor i in range(reader.num_records):\n    record = reader.read(i)\n    print(f\"  Record {i}: {record.decode('utf-8')}\")\nreader.close()\n\n# --- Reading Records (Multi-file API with DataSource) ---\n# Note: For DataSource, the writer MUST specify group_size='group_size:1'\nfrom array_record.python import array_record_data_source\n\n# In a real scenario, you'd have multiple files, e.g., ['file1.array_record', 'file2.array_record']\ndata_source = array_record_data_source.ArrayRecordDataSource([output_file])\n\nprint(f\"Reading records using DataSource from {output_file}:\")\nfor i in range(len(data_source)):\n    record = data_source[i]\n    print(f\"  DataSource Record {i}: {record.decode('utf-8')}\")\n\n# Clean up the created file\nos.remove(output_file)\nprint(f\"Cleaned up {output_file}\")","lang":"python","description":"This quickstart demonstrates how to write records to an ArrayRecord file and then read them back using both the file-level API (`array_record_module`) and the multi-file API (`array_record_data_source`). It highlights the importance of `group_size` during writing for optimal reading patterns."},"warnings":[{"fix":"For accessing multiple records simultaneously, use the newly introduced `__getitems__()` method. For single record access, ensure you pass a single integer index.","message":"The `__getitem__` method signature changed in `v0.4.0`. It now strictly accepts a single integer index and returns a single record, aligning with Python's standard `__getitem__` behavior. Batching multiple indexes is no longer supported directly via `__getitem__`.","severity":"breaking","affected_versions":">=0.4.0"},{"fix":"Update `__getitem__` calls to pass single integer indices. For batch reads, adapt to use iteration over single accesses or upgrade to `v0.4.0+` and use `__getitems__()`.","message":"Prior to `v0.4.0`, there was a transition in how `__getitem__` handled batching (e.g., in `v0.3.0` batching was no longer *required* for good performance but was still implicitly handled, paving the way for the `v0.4.0` strict single-item access). Code relying on `__getitem__` to implicitly handle lists of indices will break.","severity":"breaking","affected_versions":">=0.3.0, <0.4.0"},{"fix":"Always initialize `ArrayRecordWriter` with `group_size='group_size:1'` if you intend to use `ArrayRecordDataSource` with random access patterns.","message":"When using `array_record_data_source` for multi-file access and random access, it is crucial that the `ArrayRecordWriter` specified `group_size='group_size:1'` when creating the ArrayRecord files. If not, the `DataSource` may not function as expected or might be inefficient.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your Python environment is version 3.11 or higher. Check the `Requires-Python` metadata on PyPI for the specific version you are installing.","message":"ArrayRecord requires Python 3.11 or newer for current versions. Older versions (e.g., 0.2.0) supported Python 3.8+.","severity":"gotcha","affected_versions":">=0.6.0 (approximate, current versions)"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}