TorchFT Nightly

raw JSON →
2026.4.27 verified Mon Apr 27 auth: no python

TorchFT (Fault Tolerance) is a PyTorch library providing fault tolerant distributed training with automatic recovery from node failures. The nightly version (2026.4.27) tracks the latest development on PyTorch main branch. Requires Python >=3.8. Released daily.

pip install torchft-nightly
error ModuleNotFoundError: No module named 'torchft'
cause Installed `torchft` instead of `torchft-nightly` or forgot to install.
fix
Run pip install torchft-nightly. Note: There is no stable torchft package on PyPI; only nightly is published.
error AttributeError: module 'torchft' has no attribute 'TorchftManager'
cause Direct import of `TorchftManager` from top-level package instead of its submodule.
fix
Use from torchft.manager import TorchftManager and ensure the subpackage is installed.
deprecated The `torchft.elastic` module is deprecated in favor of `torchft.manager` since nightly build 2026.3.15. Use `TorchftManager` instead of `TorchftElasticAgent`.
fix Replace `from torchft.elastic import TorchftElasticAgent` with `from torchft.manager import TorchftManager`.
breaking In nightly builds after 2026.4.10, the `TorchftManager` constructor requires `store_addr` as a string; previously it accepted optional `Store` object. This may break code using `torch.distributed.Store`.
fix Change `TorchftManager(store=my_store, ...)` to `TorchftManager(store_addr='host:port', ...)` and use a TCPStore internally.
gotcha TorchFT nightly does not support CPU-only training; it requires CUDA. Running on CPU may cause silent hangs during heartbeat.
fix Use `torch.cuda.is_available()` to assert GPU availability before using TorchFT.

Initialize a fault-tolerant distributed training loop using TorchFT manager.

import torch
import torch.distributed as dist
from torchft.manager import TorchftManager

# Initialize the process group (example: NCCL backend)
dist.init_process_group(backend='nccl')

# Create a TorchFT manager with fault tolerance
manager = TorchftManager(
    store_addr=os.environ.get('STORE_ADDR', 'localhost:1234'),
    world_size=4,
    rank=dist.get_rank(),
    heartbeat_interval=1.0,
)

# Wrap your model with the manager
model = torch.nn.Linear(10, 10).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for step in range(100):
    inputs = torch.randn(32, 10, device='cuda')
    outputs = model(inputs)
    loss = outputs.sum()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    manager.commit()  # checkpoint after each step

# Cleanup
manager.shutdown()
dist.destroy_process_group()