{"id":6723,"library":"mosaicml-streaming","title":"MosaicML Streaming","description":"MosaicML Streaming (StreamingDataset) provides PyTorch-compatible datasets that can be efficiently streamed from cloud-based object stores (S3, GCS, Azure Blob Storage, Hugging Face Hub) or local filesystems. It enables training on large datasets without needing to download them entirely beforehand, improving data loading performance and reducing storage costs. The library is actively maintained with frequent updates, currently at version 0.13.0.","status":"active","version":"0.13.0","language":"en","source_language":"en","source_url":"https://github.com/mosaicml/streaming/","tags":["pytorch","data-streaming","cloud","dataset","object-storage","mlops"],"install":[{"cmd":"pip install mosaicml-streaming","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for PyTorch compatibility and DataLoader integration.","package":"torch","optional":false}],"imports":[{"symbol":"StreamingDataset","correct":"from streaming import StreamingDataset"},{"note":"Used for creating MosaicML Streaming (MDS) datasets.","symbol":"MDSWriter","correct":"from streaming import MDSWriter"}],"quickstart":{"code":"import os\nimport torch\nfrom streaming import StreamingDataset\nfrom torch.utils.data import DataLoader\nimport json\n\n# Define local paths for quickstart demonstration\n# In a real scenario, 'remote' would point to your cloud MDS dataset\n# (e.g., \"s3://my-bucket/data\" or \"gs://my-bucket/data\").\n# Ensure cloud credentials are set in environment variables for cloud remotes.\nlocal_remote_path = \"quickstart_mds_data\"\nlocal_cache_path = \"quickstart_mds_cache\"\n\n# --- Create a dummy MDS dataset for local testing if it doesn't exist ---\n# For actual use, you'd generate MDS datasets with `streaming.MDSWriter`\n# or point to existing ones in cloud storage.\nif not os.path.exists(local_remote_path):\n    print(f\"Creating dummy MDS data in '{local_remote_path}'...\")\n    os.makedirs(local_remote_path)\n    # A minimal `index.json` is required by StreamingDataset\n    index_data = {\n        \"version\": 2,\n        \"shards\": [\n            {\"shard\": 0, \"samples\": 2, \"hash\": \"dummy_hash\", \"size\": 100, \n             \"raw_data_size\": 80, \"zip_data_size\": 20, \"compression\": None, \n             \"format\": None}\n        ]\n    }\n    with open(os.path.join(local_remote_path, 'index.json'), 'w') as f:\n        json.dump(index_data, f)\n    # A minimal shard file is also expected, though its content won't be processed\n    # in this simplified example without actual schema.\n    with open(os.path.join(local_remote_path, '00000.mds'), 'wb') as f:\n        f.write(b'dummy_data_content_for_shard_0')\n    print(\"Dummy MDS data created.\")\nelse:\n    print(f\"Using existing dummy MDS data in '{local_remote_path}'.\")\n\nos.makedirs(local_cache_path, exist_ok=True)\n# --- End of dummy MDS creation ---\n\n# 1. Initialize the StreamingDataset\ndataset = StreamingDataset(\n    local=local_cache_path,  # Local cache directory for downloaded shards\n    remote=local_remote_path, # Path to your MDS dataset (local or cloud)\n    shuffle=True,\n    batch_size=1, # Adjust batch size for internal buffering\n    # Other parameters like `predownload` can be tuned for performance\n)\n\n# 2. Create a PyTorch DataLoader\ndataloader = DataLoader(\n    dataset=dataset,\n    batch_size=1, # DataLoader batch size\n    num_workers=0, # Use 0 workers for simple local testing to avoid multiprocessing issues\n)\n\n# 3. Iterate over the data\nprint(f\"Dataset has {len(dataset)} samples.\")\nfor i, batch in enumerate(dataloader):\n    # In this dummy setup, 'batch' will be raw bytes as no actual data schema is defined.\n    # With a real MDS dataset, this would be structured data (e.g., dicts, tensors).\n    print(f\"Batch {i}: {batch}\")\n    if i >= 1: # Process a few batches for demonstration\n        break\n\n# Note: For production use, remember to configure cloud credentials\n# (e.g., via environment variables like AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,\n# or cloud provider CLI configs) if 'remote' points to cloud storage.","lang":"python","description":"This quickstart demonstrates how to initialize `StreamingDataset` and integrate it with `torch.utils.data.DataLoader`. It sets up a minimal local MDS dataset for immediate testing. For cloud usage, ensure the `remote` parameter points to your cloud object storage path and that necessary cloud provider credentials are correctly configured in your environment."},"warnings":[{"fix":"Upgrade Python environment to 3.10 or later.","message":"Python 3.9 support was deprecated in `v0.12.0`. Users on Python 3.9 must upgrade their Python version to 3.10 or higher (3.12+ is fully supported) to use `mosaicml-streaming` versions 0.12.0 and above.","severity":"breaking","affected_versions":">=0.12.0"},{"fix":"Ensure environment variables (e.g., `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AZURE_STORAGE_ACCOUNT_NAME`, `AZURE_STORAGE_ACCOUNT_KEY`, `HF_TOKEN`) or cloud provider CLI configurations are correctly set for the target remote storage. Consult cloud provider documentation for specific setup.","message":"Proper authentication/credentials are critical for streaming from cloud object stores (S3, GCS, Azure Blob, HF Hub). Incorrectly configured credentials are a common source of errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Upgrade `mosaicml-streaming` to version 0.8.1 or newer to benefit from the fix.","message":"Earlier versions (<0.8.1) could experience dataloader hangs between epochs, significantly impacting training time. This issue was resolved in v0.8.1.","severity":"gotcha","affected_versions":"<0.8.1"},{"fix":"Upgrade `mosaicml-streaming` to version 0.10.0 or newer to utilize reusable cloud download clients and improve stability.","message":"Prior to v0.10.0, the library created a new cloud client for each download, potentially leading to 'too many open sockets' errors or excessive cloud authentication requests. Version 0.10.0 introduced client reuse.","severity":"gotcha","affected_versions":"<0.10.0"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}