{"id":3956,"library":"deepspeed","title":"DeepSpeed","description":"DeepSpeed is a deep learning optimization library for PyTorch, developed by Microsoft, that significantly reduces computing resources required for training and inference of large-scale models. It provides techniques such as ZeRO (Zero Redundancy Optimizer) for memory optimization, DeepSpeed-MoE for Mixture of Experts, and high-performance inference. Currently at version 0.18.9, it maintains an active development pace with frequent patch releases addressing bug fixes, performance enhancements, and new feature integrations, often every few weeks.","status":"active","version":"0.18.9","language":"en","source_language":"en","source_url":"https://github.com/deepspeedai/DeepSpeed","tags":["deep-learning","pytorch","distributed-training","llm-training","gpu","optimization","hpc"],"install":[{"cmd":"pip install deepspeed","lang":"bash","label":"Base installation"},{"cmd":"DS_BUILD_OPS=1 pip install deepspeed --global-option=\"--cuda_ext\" --global-option=\"--multi_tensor_adam\" --global-option=\"--fused_lamb\" --global-option=\"--sparse_attn\" --global-option=\"--fp16\"","lang":"bash","label":"With all custom C++/CUDA extensions (recommended for performance)"}],"dependencies":[{"reason":"DeepSpeed is built on top of PyTorch and requires a compatible version for GPU operations.","package":"torch","optional":false}],"imports":[{"note":"This is the primary entry point for setting up DeepSpeed. It returns the DeepSpeed engine, optimizer, dataloader, and LR scheduler.","symbol":"deepspeed.initialize","correct":"import deepspeed\nengine, optimizer, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config_params=ds_config)"},{"note":"While DeepSpeedEngine is the core class, it's generally instantiated implicitly via `deepspeed.initialize` rather than direct import for most use cases.","symbol":"deepspeed.DeepSpeedEngine","correct":"from deepspeed.runtime.engine import DeepSpeedEngine"},{"note":"An optimized Adam optimizer for CPU offloading, often imported directly when needed.","symbol":"deepspeed.ops.adam.DeepSpeedCPUAdam","correct":"from deepspeed.ops.adam import DeepSpeedCPUAdam"}],"quickstart":{"code":"import torch\nimport torch.nn as nn\nimport deepspeed\nimport json\nimport os\n\n# 1. Define a simple PyTorch model\nclass SimpleModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.linear = nn.Linear(10, 1)\n\n    def forward(self, x):\n        return self.linear(x)\n\n# 2. Create a dummy DeepSpeed config file\nds_config = {\n    \"train_batch_size\": 2,\n    \"gradient_accumulation_steps\": 1,\n    \"optimizer\": {\n        \"type\": \"AdamW\",\n        \"params\": {\n            \"lr\": 1e-5,\n            \"betas\": [0.8, 0.999],\n            \"eps\": 1e-8,\n            \"weight_decay\": 0.01\n        }\n    },\n    \"fp16\": {\n        \"enabled\": true,\n        \"initial_scale_power\": 16\n    },\n    \"zero_optimization\": {\n        \"stage\": 1\n    },\n    \"logging\": {\n        \"steps_per_print\": 2\n    }\n}\n\n# Save the config to a temporary file\nconfig_path = \"ds_config.json\"\nwith open(config_path, \"w\") as f:\n    json.dump(ds_config, f, indent=4)\n\n# 3. Initialize DeepSpeed\ndef main():\n    # DeepSpeed usually handles torch.distributed.init_process_group internally when launched via `deepspeed` command.\n    # This block ensures it's runnable in a basic sense if run directly, but proper execution uses the deepspeed launcher.\n    if not torch.distributed.is_initialized():\n        try:\n            # Try to initialize if not already done (e.g., when running interactively or not via deepspeed launcher)\n            # This requires environment variables like MASTER_ADDR, MASTER_PORT, RANK, WORLD_SIZE set.\n            # For simple local test, one might set them manually or ignore this branch for quickstart.\n            rank = int(os.environ.get('RANK', '0'))\n            world_size = int(os.environ.get('WORLD_SIZE', '1'))\n            master_addr = os.environ.get('MASTER_ADDR', 'localhost')\n            master_port = os.environ.get('MASTER_PORT', '29500')\n            torch.distributed.init_process_group(backend=\"nccl\" if torch.cuda.is_available() else \"gloo\",\n                                               rank=rank, world_size=world_size,\n                                               init_method=f\"tcp://{master_addr}:{master_port}\")\n        except RuntimeError as e:\n            print(f\"Warning: Failed to initialize torch.distributed process group directly. Error: {e}\")\n            print(\"This is expected if not launched by DeepSpeed runner. Proceeding with potential issues.\")\n            # If running without distributed setup, DeepSpeed may not function correctly.\n            # For this quickstart, we proceed for demonstration, but a real setup needs distributed init.\n\n    model = SimpleModel()\n    # Move model to CUDA if available for DeepSpeed operations\n    if torch.cuda.is_available():\n        model.cuda()\n\n    optimizer = torch.optim.Adam(model.parameters(), lr=ds_config[\"optimizer\"][\"params\"][\"lr\"])\n\n    model_engine, optimizer, _, lr_scheduler = deepspeed.initialize(\n        model=model,\n        optimizer=optimizer,\n        config_params=ds_config,\n        model_parameters=model.parameters()\n    )\n\n    # 4. Dummy data and training step\n    input_data = torch.randn(2, 10)\n    labels = torch.randn(2, 1)\n    if torch.cuda.is_available():\n        input_data = input_data.cuda()\n        labels = labels.cuda()\n\n    for i in range(3):\n        output = model_engine(input_data)\n        loss = nn.MSELoss()(output, labels)\n        model_engine.backward(loss)\n        model_engine.step()\n        if torch.distributed.is_initialized():\n            print(f\"Rank {torch.distributed.get_rank()} - Step {i}, Loss: {loss.item():.4f}\")\n        else:\n            print(f\"Step {i}, Loss: {loss.item():.4f}\")\n\n    # Clean up the config file\n    os.remove(config_path)\n\nif __name__ == '__main__':\n    # For DeepSpeed, typically run with `deepspeed your_script.py --deepspeed_config ds_config.json`\n    # The provided main() function attempts a basic run, but for full functionality and multi-GPU,\n    # the `deepspeed` launcher is necessary.\n    try:\n        main()\n    except Exception as e:\n        print(f\"An error occurred: {e}\")\n        print(\"Hint: DeepSpeed usually requires the `deepspeed` launcher for proper distributed setup.\")\n        print(\"Try running with: `deepspeed --num_gpus=1 --deepspeed_config ds_config.json quickstart_script.py`\")\n\n","lang":"python","description":"This quickstart demonstrates a basic DeepSpeed setup for a simple PyTorch model. It defines a model, creates a dummy DeepSpeed configuration (`ds_config.json`), initializes DeepSpeed, and performs a few training steps. To run this script correctly and leverage DeepSpeed's distributed features (especially multi-GPU), you should use the DeepSpeed launcher command, e.g., `deepspeed --num_gpus=1 --deepspeed_config ds_config.json quickstart_script.py`. The script includes `torch.distributed.init_process_group` for basic local testing but the launcher handles distributed setup automatically."},"warnings":[{"fix":"Always launch your DeepSpeed training scripts using the `deepspeed` command line utility (e.g., `deepspeed --num_gpus=1 your_script.py --deepspeed_config ds_config.json`) or `torch.distributed.launch` for multi-node setups.","message":"DeepSpeed requires specific execution commands for distributed training (or even single-GPU use with its features). Running a DeepSpeed-enabled script directly with `python your_script.py` will often result in errors because `torch.distributed` is not initialized correctly.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure a CUDA-enabled GPU and compatible NVIDIA drivers are installed. For specific CPU-only or CPU-offload scenarios, consult DeepSpeed's documentation on CPU offloading and ZeRO-Offload configurations.","message":"DeepSpeed is fundamentally designed for GPU acceleration and relies heavily on CUDA. Attempting to run on CPU without specific CPU offloading configurations will either fail due to missing CUDA operations or result in extremely slow execution. Some features like ZeRO-Offload can leverage CPU, but still typically require a GPU for model computations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the official DeepSpeed documentation for comprehensive examples and explanations of configuration parameters. Start with simple configurations and gradually add complexity. Leverage DeepSpeed's `print_trainable_parameters` and logging features to debug memory and performance issues.","message":"The DeepSpeed configuration (often specified via a `ds_config.json` file) is complex and critical for controlling features like ZeRO optimization, FP16 training, optimizer choice, and more. Incorrect or incomplete configurations are a very common source of errors and unexpected behavior.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always check the DeepSpeed GitHub repository or documentation for recommended PyTorch versions. If encountering build or runtime errors after a PyTorch upgrade, try downgrading PyTorch to a version known to be compatible with your DeepSpeed version or upgrading DeepSpeed to its latest patch release.","message":"Compatibility with PyTorch versions can be strict. DeepSpeed often requires specific PyTorch versions to ensure stability and exploit the latest features/optimizations, especially for its custom CUDA kernels. Upgrading PyTorch independently of DeepSpeed can lead to build failures or runtime errors.","severity":"breaking","affected_versions":"0.17.x, 0.18.x (minor patch versions can introduce or fix compatibility issues)"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}