iden: Machine Learning Dataset Shard Manager

0.4.0 · active · verified Thu Apr 16

iden (version 0.4.0) is a simple Python library designed to manage datasets organized into shards for machine learning model training. It employs a lazy loading approach, optimizing memory usage and data pipeline efficiency by loading data from shards only when needed during the training process. The library aims to simplify the handling of large or distributed datasets. As of its latest release, it is actively maintained with a focus on data management for ML applications.

Common errors

Warnings

Install

Imports

Quickstart

The quickstart demonstrates initializing a `ShardDataset` with paths to data shards and then using a `ShardLoader` to iterate over these shards in batches. The `ShardLoader` implements lazy loading, fetching data only as required, which is crucial for training machine learning models on large datasets. The example includes a dummy shard creation for demonstration purposes.

import os
from iden import ShardDataset, ShardLoader

# Placeholder for creating dummy shards
def create_dummy_shards(base_path, num_shards=3, items_per_shard=10):
    os.makedirs(base_path, exist_ok=True)
    for i in range(num_shards):
        shard_file = os.path.join(base_path, f'shard_{i}.txt')
        with open(shard_file, 'w') as f:
            for j in range(items_per_shard):
                f.write(f'data_item_from_shard_{i}_idx_{j}\n')
    print(f'Created {num_shards} dummy shards in {base_path}')

# Setup (replace with your actual shard paths)
SHARD_BASE_PATH = './iden_data_shards'
create_dummy_shards(SHARD_BASE_PATH)

# 1. Initialize ShardDataset with paths to your data shards
# In a real scenario, this list would come from your data storage system
shard_paths = [os.path.join(SHARD_BASE_PATH, f'shard_{i}.txt') for i in range(3)]

dataset = ShardDataset(shard_paths)

# 2. Initialize ShardLoader for lazy loading
# batch_size and num_workers are typical parameters for data loading
loader = ShardLoader(dataset, batch_size=2, shuffle=True, num_workers=0)

# 3. Iterate through the data in batches
print("\nLoading data using ShardLoader:")
for epoch in range(2):
    print(f"--- Epoch {epoch + 1} ---")
    for i, batch in enumerate(loader):
        print(f"  Batch {i}: {batch}")
        if i >= 2: # Limit output for quickstart
            break
    if epoch == 0: # Ensure cleanup after first epoch for quickstart clarity
        import shutil
        shutil.rmtree(SHARD_BASE_PATH)
        print(f"Cleaned up dummy shards in {SHARD_BASE_PATH}")
        break # Exit after one epoch for quickstart


view raw JSON →