TorchFT Nightly

2026.4.27 verified Mon Apr 27 auth: no python

TorchFT (Fault Tolerance) is a PyTorch library providing fault tolerant distributed training with automatic recovery from node failures. The nightly version (2026.4.27) tracks the latest development on PyTorch main branch. Requires Python >=3.8. Released daily.

pip install torchft-nightly

Common errors

error ModuleNotFoundError: No module named 'torchft' ↓

cause Installed `torchft` instead of `torchft-nightly` or forgot to install.

fix

Run pip install torchft-nightly. Note: There is no stable torchft package on PyPI; only nightly is published.

error AttributeError: module 'torchft' has no attribute 'TorchftManager' ↓

cause Direct import of `TorchftManager` from top-level package instead of its submodule.

fix

Use from torchft.manager import TorchftManager and ensure the subpackage is installed.

Warnings

deprecated The `torchft.elastic` module is deprecated in favor of `torchft.manager` since nightly build 2026.3.15. Use `TorchftManager` instead of `TorchftElasticAgent`. ↓

fix Replace `from torchft.elastic import TorchftElasticAgent` with `from torchft.manager import TorchftManager`.

breaking In nightly builds after 2026.4.10, the `TorchftManager` constructor requires `store_addr` as a string; previously it accepted optional `Store` object. This may break code using `torch.distributed.Store`. ↓

fix Change `TorchftManager(store=my_store, ...)` to `TorchftManager(store_addr='host:port', ...)` and use a TCPStore internally.

gotcha TorchFT nightly does not support CPU-only training; it requires CUDA. Running on CPU may cause silent hangs during heartbeat. ↓

fix Use `torch.cuda.is_available()` to assert GPU availability before using TorchFT.

Imports

TorchftManager
wrong
```
from torchft import TorchftManager
```
correct
```
from torchft.manager import TorchftManager
```
TorchftManager is defined in torchft.manager module; direct import from torchft does not expose it.

TorchftElasticAgent

wrong

from torchft.elastic_agent import TorchftElasticAgent

correct

from torchft.elastic import TorchftElasticAgent

The module is named 'elastic', not 'elastic_agent'.

Quickstart

Initialize a fault-tolerant distributed training loop using TorchFT manager.

import torch
import torch.distributed as dist
from torchft.manager import TorchftManager

# Initialize the process group (example: NCCL backend)
dist.init_process_group(backend='nccl')

# Create a TorchFT manager with fault tolerance
manager = TorchftManager(
    store_addr=os.environ.get('STORE_ADDR', 'localhost:1234'),
    world_size=4,
    rank=dist.get_rank(),
    heartbeat_interval=1.0,
)

# Wrap your model with the manager
model = torch.nn.Linear(10, 10).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for step in range(100):
    inputs = torch.randn(32, 10, device='cuda')
    outputs = model(inputs)
    loss = outputs.sum()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    manager.commit()  # checkpoint after each step

# Cleanup
manager.shutdown()
dist.destroy_process_group()