{"library":"nvidia-resiliency-ext","title":"NVIDIA Resiliency Extension","description":"NVIDIA Resiliency Extension (NVRE) is a Python package that provides fault-tolerant features for framework developers and users, aiming to minimize downtime in deep learning training due to failures and interruptions. It supports features like checkpointing (local and cloud), in-job restarts, and health checks. The current version is 0.5.0, with minor releases occurring every few months to introduce new features and bug fixes.","language":"python","status":"active","last_verified":"Mon May 18","install":{"commands":["pip install nvidia-resiliency-ext"],"cli":null},"imports":["import nvre"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"import nvre\nimport os\n\ndef my_restart_callback(restart_args):\n    print(f\"[NVRE] Restart event received: {restart_args}\")\n\n# 1. Initialize the resiliency manager\n# This must be called early in your application's lifecycle.\n# In a real scenario, this might be within a distributed setup like Horovod or PyTorch DDP.\n# For simple testing, it can run standalone.\nprint(\"[NVRE] Initializing resiliency manager...\")\nnvre.init_resiliency_manager()\n\n# 2. Register a callback for restart events (optional, but good practice)\nnvre.register_restart_callback(my_restart_callback)\n\n# 3. Example: Checkpointing\ncheckpoint_id = \"my_training_state\"\n\nif nvre.has_checkpoint(checkpoint_id):\n    print(f\"[NVRE] Loading checkpoint '{checkpoint_id}'...\")\n    state = nvre.load_checkpoint(checkpoint_id)\n    current_step = state.get(\"step\", 0)\n    print(f\"[NVRE] Resuming from step {current_step}\")\nelse:\n    print(f\"[NVRE] No checkpoint found for '{checkpoint_id}'. Starting new training.\")\n    current_step = 0\n\n# Simulate some training steps\nfor i in range(current_step, current_step + 3):\n    print(f\"[NVRE] Training step {i}\")\n    # Simulate a checkpoint save every step for demonstration\n    if i % 1 == 0:\n        state_to_save = {\"step\": i + 1, \"model_config\": {\"lr\": 0.001}}\n        print(f\"[NVRE] Saving checkpoint '{checkpoint_id}' at step {i}...\")\n        nvre.save_checkpoint(checkpoint_id, state_to_save)\n\nprint(\"[NVRE] Training finished.\")\n# Cleanup (optional in many cases, but good for explicit shutdown)\nnvre.shutdown_resiliency_manager()","lang":"python","description":"This quickstart demonstrates the basic initialization of the resiliency manager, registering a restart callback, and using checkpointing to save and load training state. The example simulates training steps and checkpoint saves, showing how to resume from a previously saved state.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":{"tag":null,"tag_description":null,"last_tested":"2026-05-18","installed_version":"0.6.0","pypi_latest":"0.6.0","is_stale":false,"summary":{"python_range":"3.10–3.9","success_rate":30,"avg_install_s":70.3,"avg_import_s":null,"wheel_type":"wheel"},"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"nvidia-resiliency-ext","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":78.6,"import_time_s":null,"mem_mb":null,"disk_size":"4.7G"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"nvidia-resiliency-ext","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":75.7,"import_time_s":null,"mem_mb":null,"disk_size":"4.8G"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"nvidia-resiliency-ext","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":"broken","install_time_s":56.7,"import_time_s":null,"mem_mb":null,"disk_size":"4.8G"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.5,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"nvidia-resiliency-ext","exit_code":1,"wheel_type":null,"failure_reason":"build_error","import_side_effects":null,"install_time_s":1.6,"import_time_s":null,"mem_mb":null,"disk_size":null}]}}