NVIDIA Resiliency Extension
NVIDIA Resiliency Extension (NVRE) is a Python package that provides fault-tolerant features for framework developers and users, aiming to minimize downtime in deep learning training due to failures and interruptions. It supports features like checkpointing (local and cloud), in-job restarts, and health checks. The current version is 0.5.0, with minor releases occurring every few months to introduce new features and bug fixes.
Common errors
-
ImportError: No module named 'nvre'
cause The `nvidia-resiliency-ext` package is not installed in your Python environment or the environment where your script is being executed.fixInstall the package using pip: `pip install nvidia-resiliency-ext`. -
RuntimeError: Resiliency manager not initialized. Call nvre.init_resiliency_manager() first.
cause You are attempting to use an NVRE function (e.g., `has_checkpoint`, `save_checkpoint`) before calling `nvre.init_resiliency_manager()`.fixEnsure `nvre.init_resiliency_manager()` is called once at the start of your application, usually before any training loop or other NVRE-dependent logic. -
FileNotFoundError: [Errno 2] No such file or directory: '...' (when loading a checkpoint)
cause The specified checkpoint ID or path does not correspond to an existing checkpoint file or directory on the local filesystem or configured cloud storage.fixVerify that the `checkpoint_id` matches a previously saved checkpoint. Check the configured checkpoint storage location and ensure proper permissions. If using cloud storage, confirm credentials and bucket/path are correct.
Warnings
- gotcha The `nvre.init_resiliency_manager()` function must be called early in your application's execution before using any other NVRE features. Failing to do so will result in `RuntimeError`.
- gotcha For distributed training or specific resiliency backends (e.g., MPI), additional dependencies like `mpi4py` might be required but are not direct dependencies of the `nvidia-resiliency-ext` package. Using these features without the corresponding packages installed will lead to import errors or runtime failures.
- gotcha Cloud checkpointing features (e.g., S3, GCS) introduced in v0.4.0 require installing specific cloud provider SDKs (e.g., `boto3` for AWS S3, `google-cloud-storage` for GCS). Without these, attempts to use cloud storage will fail.
Install
-
pip install nvidia-resiliency-ext
Imports
- nvre
import nvre
Quickstart
import nvre
import os
def my_restart_callback(restart_args):
print(f"[NVRE] Restart event received: {restart_args}")
# 1. Initialize the resiliency manager
# This must be called early in your application's lifecycle.
# In a real scenario, this might be within a distributed setup like Horovod or PyTorch DDP.
# For simple testing, it can run standalone.
print("[NVRE] Initializing resiliency manager...")
nvre.init_resiliency_manager()
# 2. Register a callback for restart events (optional, but good practice)
nvre.register_restart_callback(my_restart_callback)
# 3. Example: Checkpointing
checkpoint_id = "my_training_state"
if nvre.has_checkpoint(checkpoint_id):
print(f"[NVRE] Loading checkpoint '{checkpoint_id}'...")
state = nvre.load_checkpoint(checkpoint_id)
current_step = state.get("step", 0)
print(f"[NVRE] Resuming from step {current_step}")
else:
print(f"[NVRE] No checkpoint found for '{checkpoint_id}'. Starting new training.")
current_step = 0
# Simulate some training steps
for i in range(current_step, current_step + 3):
print(f"[NVRE] Training step {i}")
# Simulate a checkpoint save every step for demonstration
if i % 1 == 0:
state_to_save = {"step": i + 1, "model_config": {"lr": 0.001}}
print(f"[NVRE] Saving checkpoint '{checkpoint_id}' at step {i}...")
nvre.save_checkpoint(checkpoint_id, state_to_save)
print("[NVRE] Training finished.")
# Cleanup (optional in many cases, but good for explicit shutdown)
nvre.shutdown_resiliency_manager()