{"id":8364,"library":"nvidia-resiliency-ext","title":"NVIDIA Resiliency Extension","description":"NVIDIA Resiliency Extension (NVRE) is a Python package that provides fault-tolerant features for framework developers and users, aiming to minimize downtime in deep learning training due to failures and interruptions. It supports features like checkpointing (local and cloud), in-job restarts, and health checks. The current version is 0.5.0, with minor releases occurring every few months to introduce new features and bug fixes.","status":"active","version":"0.5.0","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/nvidia-resiliency-ext","tags":["NVIDIA","GPU","deep-learning","fault-tolerance","resilience","checkpointing","distributed-training","MLOps"],"install":[{"cmd":"pip install nvidia-resiliency-ext","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Required for core functionalities.","package":"numpy","optional":false},{"reason":"Required for MPI-based resiliency features and distributed training contexts.","package":"mpi4py","optional":true},{"reason":"Often used for data serialization, especially in distributed contexts.","package":"protobuf","optional":true},{"reason":"Required for S3 cloud checkpointing support.","package":"boto3","optional":true},{"reason":"Required for Google Cloud Storage checkpointing support.","package":"google-cloud-storage","optional":true}],"imports":[{"note":"The primary module for accessing all NVIDIA Resiliency Extension functionalities.","symbol":"nvre","correct":"import nvre"}],"quickstart":{"code":"import nvre\nimport os\n\ndef my_restart_callback(restart_args):\n    print(f\"[NVRE] Restart event received: {restart_args}\")\n\n# 1. Initialize the resiliency manager\n# This must be called early in your application's lifecycle.\n# In a real scenario, this might be within a distributed setup like Horovod or PyTorch DDP.\n# For simple testing, it can run standalone.\nprint(\"[NVRE] Initializing resiliency manager...\")\nnvre.init_resiliency_manager()\n\n# 2. Register a callback for restart events (optional, but good practice)\nnvre.register_restart_callback(my_restart_callback)\n\n# 3. Example: Checkpointing\ncheckpoint_id = \"my_training_state\"\n\nif nvre.has_checkpoint(checkpoint_id):\n    print(f\"[NVRE] Loading checkpoint '{checkpoint_id}'...\")\n    state = nvre.load_checkpoint(checkpoint_id)\n    current_step = state.get(\"step\", 0)\n    print(f\"[NVRE] Resuming from step {current_step}\")\nelse:\n    print(f\"[NVRE] No checkpoint found for '{checkpoint_id}'. Starting new training.\")\n    current_step = 0\n\n# Simulate some training steps\nfor i in range(current_step, current_step + 3):\n    print(f\"[NVRE] Training step {i}\")\n    # Simulate a checkpoint save every step for demonstration\n    if i % 1 == 0:\n        state_to_save = {\"step\": i + 1, \"model_config\": {\"lr\": 0.001}}\n        print(f\"[NVRE] Saving checkpoint '{checkpoint_id}' at step {i}...\")\n        nvre.save_checkpoint(checkpoint_id, state_to_save)\n\nprint(\"[NVRE] Training finished.\")\n# Cleanup (optional in many cases, but good for explicit shutdown)\nnvre.shutdown_resiliency_manager()","lang":"python","description":"This quickstart demonstrates the basic initialization of the resiliency manager, registering a restart callback, and using checkpointing to save and load training state. The example simulates training steps and checkpoint saves, showing how to resume from a previously saved state."},"warnings":[{"fix":"Ensure `nvre.init_resiliency_manager()` is the first NVRE call in your main execution path.","message":"The `nvre.init_resiliency_manager()` function must be called early in your application's execution before using any other NVRE features. Failing to do so will result in `RuntimeError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install necessary optional dependencies for your chosen resiliency backend, e.g., `pip install mpi4py` or `pip install boto3` for cloud checkpointing.","message":"For distributed training or specific resiliency backends (e.g., MPI), additional dependencies like `mpi4py` might be required but are not direct dependencies of the `nvidia-resiliency-ext` package. Using these features without the corresponding packages installed will lead to import errors or runtime failures.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install the appropriate SDK for your cloud provider: `pip install boto3` for AWS S3, `pip install google-cloud-storage` for GCS.","message":"Cloud checkpointing features (e.g., S3, GCS) introduced in v0.4.0 require installing specific cloud provider SDKs (e.g., `boto3` for AWS S3, `google-cloud-storage` for GCS). Without these, attempts to use cloud storage will fail.","severity":"gotcha","affected_versions":">=0.4.0"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install the package using pip: `pip install nvidia-resiliency-ext`.","cause":"The `nvidia-resiliency-ext` package is not installed in your Python environment or the environment where your script is being executed.","error":"ImportError: No module named 'nvre'"},{"fix":"Ensure `nvre.init_resiliency_manager()` is called once at the start of your application, usually before any training loop or other NVRE-dependent logic.","cause":"You are attempting to use an NVRE function (e.g., `has_checkpoint`, `save_checkpoint`) before calling `nvre.init_resiliency_manager()`.","error":"RuntimeError: Resiliency manager not initialized. Call nvre.init_resiliency_manager() first."},{"fix":"Verify that the `checkpoint_id` matches a previously saved checkpoint. Check the configured checkpoint storage location and ensure proper permissions. If using cloud storage, confirm credentials and bucket/path are correct.","cause":"The specified checkpoint ID or path does not correspond to an existing checkpoint file or directory on the local filesystem or configured cloud storage.","error":"FileNotFoundError: [Errno 2] No such file or directory: '...' (when loading a checkpoint)"}]}