{"id":2813,"library":"torchdata","title":"TorchData","description":"TorchData is a Python library providing composable data loading modules for PyTorch, aiming to enhance `torch.utils.data.DataLoader` and `torch.utils.data.Dataset/IterableDataset` for scalable and performant data pipelines. It focuses on new features like `StatefulDataLoader` for checkpointing and `torchdata.nodes` for flexible data processing graphs. The current version is 0.11.0. After a period of re-evaluation, development has resumed with a focus on iterative enhancements to existing PyTorch data primitives.","status":"active","version":"0.11.0","language":"en","source_language":"en","source_url":"https://github.com/pytorch/data","tags":["pytorch","data-loading","machine-learning","deep-learning","etl","data-pipeline"],"install":[{"cmd":"pip install torchdata","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"TorchData is built to extend PyTorch's data loading capabilities and is a core part of the PyTorch ecosystem.","package":"torch","optional":false}],"imports":[{"symbol":"StatefulDataLoader","correct":"from torchdata.stateful_dataloader import StatefulDataLoader"},{"note":"Represents the new direction for building data pipelines with composable iterators.","symbol":"nodes","correct":"import torchdata.nodes as nodes"},{"note":"DataPipes are deprecated and largely removed starting from v0.9.0. Migrate to `torchdata.nodes` or `StatefulDataLoader`.","wrong":"from torchdata.datapipes.iter import ...","symbol":"DataPipes","correct":"from torchdata.datapipes.iter import IterableWrapper"},{"note":"DataLoader2 is deprecated and largely removed starting from v0.9.0. Use `StatefulDataLoader` or standard `torch.utils.data.DataLoader` instead.","wrong":"from torchdata.dataloader2 import DataLoader2","symbol":"DataLoader2","correct":"from torchdata.dataloader2 import DataLoader2"}],"quickstart":{"code":"import torch\nfrom torch.utils.data import TensorDataset\nfrom torchdata.stateful_dataloader import StatefulDataLoader\n\n# Create a dummy dataset\ndata = torch.randn(100, 10)\nlabels = torch.randint(0, 2, (100,))\ndataset = TensorDataset(data, labels)\n\n# Use StatefulDataLoader as a drop-in replacement for torch.utils.data.DataLoader\nbatch_size = 16\ndataloader = StatefulDataLoader(\n    dataset,\n    batch_size=batch_size,\n    shuffle=True,\n    num_workers=0 # For simplicity, use 0 workers\n)\n\nprint(f\"Number of batches: {len(dataloader)}\")\n\n# Iterate through the data\nfor epoch in range(2):\n    print(f\"\\nEpoch {epoch + 1}\")\n    for i, (batch_data, batch_labels) in enumerate(dataloader):\n        if i % 10 == 0:\n            print(f\"  Batch {i}: data_shape={batch_data.shape}, labels_shape={batch_labels.shape}\")\n        # In a real scenario, perform training steps here\n\n# Example of saving and loading state (checkpointing)\n# This is a key feature of StatefulDataLoader\nstate = dataloader.state_dict()\nprint(f\"\\nSaved dataloader state: {state.keys()}\")\n\n# Simulate continued training or restart\nnew_dataloader = StatefulDataLoader(dataset, batch_size=batch_size, shuffle=True, num_workers=0)\nnew_dataloader.load_state_dict(state)\nprint(\"Loaded dataloader state.\")\n\n# Iteration will resume from where it left off\nprint(\"Resuming iteration (should continue from saved state):\")\nfor i, (batch_data, batch_labels) in enumerate(new_dataloader):\n    if i < 3:\n        print(f\"  Resumed Batch {i}: data_shape={batch_data.shape}, labels_shape={batch_labels.shape}\")\n","lang":"python","description":"This quickstart demonstrates how to use `StatefulDataLoader`, which is a key enhancement of `torch.utils.data.DataLoader` provided by TorchData. It's a drop-in replacement that adds checkpointing capabilities."},"warnings":[{"fix":"Migrate existing pipelines to `torchdata.nodes` or leverage `StatefulDataLoader` as an enhancement to `torch.utils.data.DataLoader`. If you must use DataPipes/DataLoader2, pin your dependency to `torchdata<=0.8.0`.","message":"DataPipes and DataLoader2, which were core components of earlier TorchData versions, have been largely removed from the library starting with version 0.9.0. They were marked as deprecated in v0.8.0. Subsequent releases, including 0.11.0, do not include or maintain these solutions.","severity":"breaking","affected_versions":">=0.9.0"},{"fix":"Upgrade your Python environment to version 3.9 or newer. The current release (0.11.0) requires Python >=3.9.","message":"Python 3.8 support was dropped in TorchData v0.9.0.","severity":"breaking","affected_versions":">=0.9.0"},{"fix":"Install TorchData via pip from PyPI: `pip install torchdata`.","message":"TorchData has deprecated and removed its conda builds, as PyTorch's official conda channel itself is deprecated.","severity":"deprecated","affected_versions":"All versions"},{"fix":"Refer to the official documentation for `StatefulDataLoader` regarding checkpointing and worker/seed management to ensure deterministic and correct resumption of training. Test your checkpointing logic thoroughly.","message":"Be aware of specific behaviors in `StatefulDataLoader` related to `num_workers=0` and initial seeding for `RandomSampler` during state loading. These can lead to unexpected iteration patterns if not handled carefully.","severity":"gotcha","affected_versions":"0.8.0, 0.11.0 (and potentially others)"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}