iden: Machine Learning Dataset Shard Manager
iden (version 0.4.0) is a simple Python library designed to manage datasets organized into shards for machine learning model training. It employs a lazy loading approach, optimizing memory usage and data pipeline efficiency by loading data from shards only when needed during the training process. The library aims to simplify the handling of large or distributed datasets. As of its latest release, it is actively maintained with a focus on data management for ML applications.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'path/to/non_existent_shard.npy'
cause The `ShardDataset` was initialized with a path to a shard file that does not exist at the specified location. This often happens due to incorrect path configurations or data movement.fixVerify all paths passed to `ShardDataset` or `ShardLoader` refer to actual, accessible data files. Use absolute paths or ensure your working directory is correctly set relative to the shard locations. -
IndexError: list index out of range (during iteration of ShardLoader)
cause This error typically occurs if there's a mismatch between the expected number of items in a shard and what's actually available, or an issue with the indexing logic within a custom `ShardDataset` implementation when accessing individual samples.fixReview the `__len__` and `__getitem__` methods of your `ShardDataset` implementation. Ensure `__len__` accurately reports the total number of samples and `__getitem__` handles all valid indices without going out of bounds for any given shard. -
MemoryError: Unable to allocate ... (during batch processing)
cause Despite lazy loading at the shard level, if individual data samples within a shard are excessively large, or the `batch_size` is too high, the system might still run out of memory when loading a batch into RAM.fixReduce the `batch_size` in `ShardLoader`. Consider further splitting very large data samples into smaller units or implementing more granular lazy loading within the `ShardDataset`'s `__getitem__` method if samples themselves are huge.
Warnings
- gotcha Lazy loading implies that data transformations or heavy preprocessing should ideally be part of the dataset's `__getitem__` or `ShardLoader`'s batching mechanism to ensure efficiency. Applying heavy computations globally before loading can negate lazy loading benefits.
- gotcha When handling a large number of shards, ensuring unique and persistent identifiers for each shard is critical for reproducibility, especially if shards are added, removed, or reordered. Relying solely on file path order can lead to inconsistencies.
Install
-
pip install iden
Imports
- ShardDataset
from iden import ShardDataset
- ShardLoader
from iden import ShardLoader
Quickstart
import os
from iden import ShardDataset, ShardLoader
# Placeholder for creating dummy shards
def create_dummy_shards(base_path, num_shards=3, items_per_shard=10):
os.makedirs(base_path, exist_ok=True)
for i in range(num_shards):
shard_file = os.path.join(base_path, f'shard_{i}.txt')
with open(shard_file, 'w') as f:
for j in range(items_per_shard):
f.write(f'data_item_from_shard_{i}_idx_{j}\n')
print(f'Created {num_shards} dummy shards in {base_path}')
# Setup (replace with your actual shard paths)
SHARD_BASE_PATH = './iden_data_shards'
create_dummy_shards(SHARD_BASE_PATH)
# 1. Initialize ShardDataset with paths to your data shards
# In a real scenario, this list would come from your data storage system
shard_paths = [os.path.join(SHARD_BASE_PATH, f'shard_{i}.txt') for i in range(3)]
dataset = ShardDataset(shard_paths)
# 2. Initialize ShardLoader for lazy loading
# batch_size and num_workers are typical parameters for data loading
loader = ShardLoader(dataset, batch_size=2, shuffle=True, num_workers=0)
# 3. Iterate through the data in batches
print("\nLoading data using ShardLoader:")
for epoch in range(2):
print(f"--- Epoch {epoch + 1} ---")
for i, batch in enumerate(loader):
print(f" Batch {i}: {batch}")
if i >= 2: # Limit output for quickstart
break
if epoch == 0: # Ensure cleanup after first epoch for quickstart clarity
import shutil
shutil.rmtree(SHARD_BASE_PATH)
print(f"Cleaned up dummy shards in {SHARD_BASE_PATH}")
break # Exit after one epoch for quickstart