NVIDIA Resiliency Extension
JSON →NVIDIA Resiliency Extension (NVRE) is a Python package that provides fault-tolerant features for framework developers and users, aiming to minimize downtime in deep learning training due to failures and interruptions. It supports features like checkpointing (local and cloud), in-job restarts, and health checks. The current version is 0.5.0, with minor releases occurring every few months to introduce new features and bug fixes.
Traffic · last 30 days ↓17% vs prev 7d
total hits 16
actors 8 distinct systems
last hit 3d ago MetaBot
top countries 🇺🇸 United States · 🇨🇦 Canada · 🇩🇪 Germany · 🇫🇷 France
Resources
API endpoints
full doc /v1/registry/nvidia-resiliency-ext
compatibility /v1/registry/nvidia-resiliency-ext/compatibility