NVIDIA Resiliency Extension

0.5.0 · active · verified Thu Apr 16

NVIDIA Resiliency Extension (NVRE) is a Python package that provides fault-tolerant features for framework developers and users, aiming to minimize downtime in deep learning training due to failures and interruptions. It supports features like checkpointing (local and cloud), in-job restarts, and health checks. The current version is 0.5.0, with minor releases occurring every few months to introduce new features and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the basic initialization of the resiliency manager, registering a restart callback, and using checkpointing to save and load training state. The example simulates training steps and checkpoint saves, showing how to resume from a previously saved state.

import nvre
import os

def my_restart_callback(restart_args):
    print(f"[NVRE] Restart event received: {restart_args}")

# 1. Initialize the resiliency manager
# This must be called early in your application's lifecycle.
# In a real scenario, this might be within a distributed setup like Horovod or PyTorch DDP.
# For simple testing, it can run standalone.
print("[NVRE] Initializing resiliency manager...")
nvre.init_resiliency_manager()

# 2. Register a callback for restart events (optional, but good practice)
nvre.register_restart_callback(my_restart_callback)

# 3. Example: Checkpointing
checkpoint_id = "my_training_state"

if nvre.has_checkpoint(checkpoint_id):
    print(f"[NVRE] Loading checkpoint '{checkpoint_id}'...")
    state = nvre.load_checkpoint(checkpoint_id)
    current_step = state.get("step", 0)
    print(f"[NVRE] Resuming from step {current_step}")
else:
    print(f"[NVRE] No checkpoint found for '{checkpoint_id}'. Starting new training.")
    current_step = 0

# Simulate some training steps
for i in range(current_step, current_step + 3):
    print(f"[NVRE] Training step {i}")
    # Simulate a checkpoint save every step for demonstration
    if i % 1 == 0:
        state_to_save = {"step": i + 1, "model_config": {"lr": 0.001}}
        print(f"[NVRE] Saving checkpoint '{checkpoint_id}' at step {i}...")
        nvre.save_checkpoint(checkpoint_id, state_to_save)

print("[NVRE] Training finished.")
# Cleanup (optional in many cases, but good for explicit shutdown)
nvre.shutdown_resiliency_manager()

view raw JSON →