Megatron-Energon
raw JSON → 7.3.2 verified Fri May 01 auth: no python
NVIDIA Megatron-Energon is a multi-modal data loader library for large-scale deep learning, particularly for training large language models (LLMs) and vision-language models. It supports tar-based WebDataset, JSONL files, and polylithic datasets, with features like caching, AV decoding, and FUSE mount. Current version is 7.3.2, with active development and a release cadence of roughly monthly.
pip install megatron-energon Common errors
error ModuleNotFoundError: No module named 'energon' ↓
cause Attempting to import directly from 'energon' instead of 'megatron.energon'.
fix
Use
from megatron.energon import ... error sqlite3.OperationalError: locking protocol ↓
cause Concurrent SQLite access in cache database without proper retry logic.
fix
Upgrade to >=7.2.2 which includes a fix for this locking issue.
error ValueError: Do not know how to load sample, given the available subflavors ↓
cause The dataset metadata does not match the sample type you are requesting.
fix
Ensure your dataset's
.nvs file includes the correct subflavors (e.g., 'image', 'video', 'audio'). Warnings
breaking Version 7.0.0 introduced polylithic datasets and a new AVDecoder, breaking datasets created with v6.x. Existing datasets may need re-preparation. ↓
fix Run `energon prepare` again or use an older version (6.0.1) if you must keep legacy format.
breaking In version 6.0.0, the save/restore mechanism changed: worker dimension is now the outer dimension. Checkpoint compatibility broken. ↓
fix Implement new save/restore logic as per updated docs; old checkpoints are not compatible.
gotcha The `disable_cache` option (for ITarReader) is only available from 7.3.2 onward. Before that, caching may cause thread-safety issues. ↓
fix Upgrade to >=7.3.2 or avoid concurrent tar readers.
deprecated The use of fsspec was replaced by EPath in 6.0.1. Code relying on fsspec paths may break. ↓
fix Update to EPath paths as described in the migration guide.
Install
pip install megatron-energon[all] Imports
- get_train_dataset wrong
from energon import get_train_datasetcorrectfrom megatron.energon import get_train_dataset - WorkerConfig wrong
from energon import WorkerConfigcorrectfrom megatron.energon import WorkerConfig - Webdataset
from megatron.energon import Webdataset
Quickstart
import os
from megatron.energon import get_train_dataset, WorkerConfig
# Create a simple dataset
worker_config = WorkerConfig(
rank=0,
world_size=1,
num_workers=2,
)
dataset = get_train_dataset(
path=os.environ.get('DATASET_PATH', 'path/to/dataset'),
worker_config=worker_config,
batch_size=4,
shuffle_buffer_size=100,
)
for batch in dataset:
# Each batch is a dict with keys like 'rgb', 'json'
print(batch.keys())
break