Accelerate
Hugging Face library to run PyTorch training across any distributed configuration with minimal code changes. Current version is 1.13.0 (Mar 2026). Requires Python >=3.10. Core pattern: Accelerator() + accelerator.prepare() + accelerator.backward(). Must run accelerate config before first use.
Warnings
- breaking accelerate config must be run before first use. Without a config file, Accelerate falls back to single-process CPU mode silently — multi-GPU training simply won't use multiple GPUs.
- breaking Python 3.9 support dropped in 1.13.0. Accelerate now requires Python >=3.10.
- breaking Accelerator() initialized outside the training function raises ValueError when using notebook_launcher for multi-GPU. Silently falls back to 1 GPU without error if no notebook_launcher is used.
- breaking accelerator.load_state() fails with PyTorch 2.6+ due to torch.load weights_only=True default flip. Optimizer states with custom objects (omegaconf.ListConfig, etc.) raise UnpicklingError.
- breaking DeepSpeed integration: only one nn.Module per Accelerator instance is supported. Passing multiple models to accelerator.prepare() with DeepSpeed raises AssertionError.
- gotcha accelerate launch ignores Python script argument ordering. Flags intended for the script must come after --, otherwise they are parsed as accelerate launch flags.
- gotcha loss.backward() instead of accelerator.backward(loss) silently bypasses mixed precision gradient scaling. Training proceeds but gradients are wrong under fp16/bf16 — numerical instability or NaN loss.
Install
-
pip install accelerate -
accelerate config -
python -c "from accelerate.utils import write_basic_config; write_basic_config(mixed_precision='fp16')"
Imports
- Accelerator
from accelerate import Accelerator def training_function(): # Accelerator MUST be initialized inside the training function for notebook_launcher accelerator = Accelerator(mixed_precision='fp16') model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader) for batch in dataloader: optimizer.zero_grad() loss = model(batch) accelerator.backward(loss) # NOT loss.backward() optimizer.step() - accelerator.backward
loss = criterion(outputs, targets) accelerator.backward(loss)
Quickstart
from accelerate import Accelerator
import torch
import torch.nn as nn
def train():
accelerator = Accelerator(mixed_precision='bf16')
model = nn.Linear(10, 1)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
dataloader = ... # your DataLoader
# prepare() handles device placement and distributed wrapping
model, optimizer, dataloader = accelerator.prepare(
model, optimizer, dataloader
)
model.train()
for batch in dataloader:
optimizer.zero_grad()
outputs = model(batch['input'])
loss = nn.functional.mse_loss(outputs, batch['target'])
accelerator.backward(loss) # not loss.backward()
optimizer.step()
# Save on main process only
accelerator.wait_for_everyone()
if accelerator.is_main_process:
accelerator.save_model(model, 'output/')