TorchFT Nightly
raw JSON → 2026.4.27 verified Mon Apr 27 auth: no python
TorchFT (Fault Tolerance) is a PyTorch library providing fault tolerant distributed training with automatic recovery from node failures. The nightly version (2026.4.27) tracks the latest development on PyTorch main branch. Requires Python >=3.8. Released daily.
pip install torchft-nightly Common errors
error ModuleNotFoundError: No module named 'torchft' ↓
cause Installed `torchft` instead of `torchft-nightly` or forgot to install.
fix
Run
pip install torchft-nightly. Note: There is no stable torchft package on PyPI; only nightly is published. error AttributeError: module 'torchft' has no attribute 'TorchftManager' ↓
cause Direct import of `TorchftManager` from top-level package instead of its submodule.
fix
Use
from torchft.manager import TorchftManager and ensure the subpackage is installed. Warnings
deprecated The `torchft.elastic` module is deprecated in favor of `torchft.manager` since nightly build 2026.3.15. Use `TorchftManager` instead of `TorchftElasticAgent`. ↓
fix Replace `from torchft.elastic import TorchftElasticAgent` with `from torchft.manager import TorchftManager`.
breaking In nightly builds after 2026.4.10, the `TorchftManager` constructor requires `store_addr` as a string; previously it accepted optional `Store` object. This may break code using `torch.distributed.Store`. ↓
fix Change `TorchftManager(store=my_store, ...)` to `TorchftManager(store_addr='host:port', ...)` and use a TCPStore internally.
gotcha TorchFT nightly does not support CPU-only training; it requires CUDA. Running on CPU may cause silent hangs during heartbeat. ↓
fix Use `torch.cuda.is_available()` to assert GPU availability before using TorchFT.
Imports
- TorchftManager wrong
from torchft import TorchftManagercorrectfrom torchft.manager import TorchftManager - TorchftElasticAgent wrong
from torchft.elastic_agent import TorchftElasticAgentcorrectfrom torchft.elastic import TorchftElasticAgent
Quickstart
import torch
import torch.distributed as dist
from torchft.manager import TorchftManager
# Initialize the process group (example: NCCL backend)
dist.init_process_group(backend='nccl')
# Create a TorchFT manager with fault tolerance
manager = TorchftManager(
store_addr=os.environ.get('STORE_ADDR', 'localhost:1234'),
world_size=4,
rank=dist.get_rank(),
heartbeat_interval=1.0,
)
# Wrap your model with the manager
model = torch.nn.Linear(10, 10).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
for step in range(100):
inputs = torch.randn(32, 10, device='cuda')
outputs = model(inputs)
loss = outputs.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
manager.commit() # checkpoint after each step
# Cleanup
manager.shutdown()
dist.destroy_process_group()