LitData
raw JSON → 0.2.61 verified Mon Apr 27 auth: no python
A high-performance data processing library for AI workflows, part of the Lightning AI ecosystem. Provides optimized streaming datasets and data loaders for training deep learning models. Current version: 0.2.61. Active development with frequent weekly releases.
pip install litdata Common errors
error FileNotFoundError: No such file or directory ↓
cause Input directory does not contain properly formatted chunk files or the path is incorrect.
fix
Preprocess your data using
from litdata import optimize; optimize(...) to create chunks. Ensure the input_dir points to a directory with .bin and .mtx files. error ModuleNotFoundError: No module named 'lightning' ↓
cause Attempting to import from the old package name 'lightning' instead of 'litdata'.
fix
Use
from litdata import StreamingDataset instead of from lightning.data import StreamingDataset. Warnings
breaking In v0.2.55, writing compressed data to Lightning Storage directories was fixed. Previous versions could break. Upgrade to >=0.2.55 if using compressed output. ↓
fix pip install litdata>=0.2.55
deprecated The `LightningDataset` class may be deprecated in future versions in favor of `StreamingDataset`. Check release notes for migration. ↓
fix Use StreamingDataset directly.
gotcha StreamingDataset expects a specific directory structure. If you pass a path without properly chunked files, it may raise FileNotFoundError or hang. Always preprocess data using `optimize` function first. ↓
fix Use `optimize` from litdata to convert raw data into chunked format before streaming.
Imports
- StreamingDataset wrong
from lightning.data import StreamingDatasetcorrectfrom litdata import StreamingDataset - StreamingDataLoader wrong
from litdata.streaming import StreamingDataLoadercorrectfrom litdata import StreamingDataLoader - optimize wrong
from litdata.processing import optimizecorrectfrom litdata import optimize - LightningDataset
from litdata import LightningDataset
Quickstart
from litdata import StreamingDataset, StreamingDataLoader
# Create a simple streaming dataset
class MyDataset(StreamingDataset):
def __init__(self):
super().__init__(input_dir="s3://my-bucket/data", shuffle=True)
dataset = MyDataset()
dataloader = StreamingDataLoader(dataset, batch_size=32)
for batch in dataloader:
print(batch)
break