DeepSpeed
DeepSpeed is a deep learning optimization library for PyTorch, developed by Microsoft, that significantly reduces computing resources required for training and inference of large-scale models. It provides techniques such as ZeRO (Zero Redundancy Optimizer) for memory optimization, DeepSpeed-MoE for Mixture of Experts, and high-performance inference. Currently at version 0.18.9, it maintains an active development pace with frequent patch releases addressing bug fixes, performance enhancements, and new feature integrations, often every few weeks.
Warnings
- gotcha DeepSpeed requires specific execution commands for distributed training (or even single-GPU use with its features). Running a DeepSpeed-enabled script directly with `python your_script.py` will often result in errors because `torch.distributed` is not initialized correctly.
- gotcha DeepSpeed is fundamentally designed for GPU acceleration and relies heavily on CUDA. Attempting to run on CPU without specific CPU offloading configurations will either fail due to missing CUDA operations or result in extremely slow execution. Some features like ZeRO-Offload can leverage CPU, but still typically require a GPU for model computations.
- gotcha The DeepSpeed configuration (often specified via a `ds_config.json` file) is complex and critical for controlling features like ZeRO optimization, FP16 training, optimizer choice, and more. Incorrect or incomplete configurations are a very common source of errors and unexpected behavior.
- breaking Compatibility with PyTorch versions can be strict. DeepSpeed often requires specific PyTorch versions to ensure stability and exploit the latest features/optimizations, especially for its custom CUDA kernels. Upgrading PyTorch independently of DeepSpeed can lead to build failures or runtime errors.
Install
-
pip install deepspeed -
DS_BUILD_OPS=1 pip install deepspeed --global-option="--cuda_ext" --global-option="--multi_tensor_adam" --global-option="--fused_lamb" --global-option="--sparse_attn" --global-option="--fp16"
Imports
- deepspeed.initialize
import deepspeed engine, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)
- deepspeed.DeepSpeedEngine
from deepspeed.runtime.engine import DeepSpeedEngine
- deepspeed.ops.adam.DeepSpeedCPUAdam
from deepspeed.ops.adam import DeepSpeedCPUAdam
Quickstart
import torch
import torch.nn as nn
import deepspeed
import json
import os
# 1. Define a simple PyTorch model
class SimpleModel(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(10, 1)
def forward(self, x):
return self.linear(x)
# 2. Create a dummy DeepSpeed config file
ds_config = {
"train_batch_size": 2,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "AdamW",
"params": {
"lr": 1e-5,
"betas": [0.8, 0.999],
"eps": 1e-8,
"weight_decay": 0.01
}
},
"fp16": {
"enabled": true,
"initial_scale_power": 16
},
"zero_optimization": {
"stage": 1
},
"logging": {
"steps_per_print": 2
}
}
# Save the config to a temporary file
config_path = "ds_config.json"
with open(config_path, "w") as f:
json.dump(ds_config, f, indent=4)
# 3. Initialize DeepSpeed
def main():
# DeepSpeed usually handles torch.distributed.init_process_group internally when launched via `deepspeed` command.
# This block ensures it's runnable in a basic sense if run directly, but proper execution uses the deepspeed launcher.
if not torch.distributed.is_initialized():
try:
# Try to initialize if not already done (e.g., when running interactively or not via deepspeed launcher)
# This requires environment variables like MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE set.
# For simple local test, one might set them manually or ignore this branch for quickstart.
rank = int(os.environ.get('RANK', '0'))
world_size = int(os.environ.get('WORLD_SIZE', '1'))
master_addr = os.environ.get('MASTER_ADDR', 'localhost')
master_port = os.environ.get('MASTER_PORT', '29500')
torch.distributed.init_process_group(backend="nccl" if torch.cuda.is_available() else "gloo",
rank=rank, world_size=world_size,
init_method=f"tcp://{master_addr}:{master_port}")
except RuntimeError as e:
print(f"Warning: Failed to initialize torch.distributed process group directly. Error: {e}")
print("This is expected if not launched by DeepSpeed runner. Proceeding with potential issues.")
# If running without distributed setup, DeepSpeed may not function correctly.
# For this quickstart, we proceed for demonstration, but a real setup needs distributed init.
model = SimpleModel()
# Move model to CUDA if available for DeepSpeed operations
if torch.cuda.is_available():
model.cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=ds_config["optimizer"]["params"]["lr"])
model_engine, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
optimizer=optimizer,
config_params=ds_config,
model_parameters=model.parameters()
)
# 4. Dummy data and training step
input_data = torch.randn(2, 10)
labels = torch.randn(2, 1)
if torch.cuda.is_available():
input_data = input_data.cuda()
labels = labels.cuda()
for i in range(3):
output = model_engine(input_data)
loss = nn.MSELoss()(output, labels)
model_engine.backward(loss)
model_engine.step()
if torch.distributed.is_initialized():
print(f"Rank {torch.distributed.get_rank()} - Step {i}, Loss: {loss.item():.4f}")
else:
print(f"Step {i}, Loss: {loss.item():.4f}")
# Clean up the config file
os.remove(config_path)
if __name__ == '__main__':
# For DeepSpeed, typically run with `deepspeed your_script.py --deepspeed_config ds_config.json`
# The provided main() function attempts a basic run, but for full functionality and multi-GPU,
# the `deepspeed` launcher is necessary.
try:
main()
except Exception as e:
print(f"An error occurred: {e}")
print("Hint: DeepSpeed usually requires the `deepspeed` launcher for proper distributed setup.")
print("Try running with: `deepspeed --num_gpus=1 --deepspeed_config ds_config.json quickstart_script.py`")