NVIDIA Megatron Core
Megatron Core is a Python library developed by NVIDIA for building highly efficient and scalable transformer-based models, especially for large-scale distributed training. It provides fundamental building blocks for tensor and pipeline parallelism. The current version is 0.16.1, and it generally follows an active release cadence with minor versions released frequently.
Common errors
-
RuntimeError: CUDA error: no CUDA-capable device is detected
cause The system lacks a compatible NVIDIA GPU or CUDA drivers are not correctly installed/configured for PyTorch to detect them.fixVerify NVIDIA GPU presence, CUDA driver installation, and PyTorch's CUDA version compatibility (`torch.cuda.is_available()`, `torch.version.cuda`). Ensure necessary environment variables like `LD_LIBRARY_PATH` are set. -
ValueError: tensor_model_parallel_world_size is not set
cause Megatron Core's internal distributed configuration (e.g., tensor, pipeline parallelism) has not been initialized. This is crucial for its parallel layers.fixCall `megatron.core.dist_init.set_tensor_model_parallel_world_size()` and `set_tensor_model_parallel_rank()` after `torch.distributed.init_process_group()` has been successfully executed. -
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1678278270412/work/torch/csrc/distributed/c10d/NCCLUtils.cpp:218, unhandled system error (aborting at rank 0)
cause NCCL (NVIDIA Collective Communications Library) initialization failed, often due to network issues, firewall restrictions, incorrect `MASTER_ADDR`/`MASTER_PORT`, or an insufficient number of processes for the `WORLD_SIZE`.fixCheck network connectivity between nodes, verify `MASTER_ADDR` and `MASTER_PORT` environment variables are correctly set and accessible, and ensure the number of launched processes matches `WORLD_SIZE`. -
ImportError: cannot import name 'TransformerBlock' from 'megatron.core.transformer'
cause The module or class name might have changed, been moved to a different submodule, or your `megatron-core` version is old/incompatible with the code you're running.fixConsult the official Megatron-LM GitHub repository or documentation for the correct import path for your `megatron-core` version. Ensure your `megatron-core` installation is up-to-date with `pip install --upgrade megatron-core`.
Warnings
- breaking Megatron Core v0.15.0 introduced a 'new TE interface for user buffers'. Custom integrations or extensions that directly interfaced with lower-level buffers might require updates to conform to the new API.
- gotcha Megatron Core is fundamentally designed for distributed GPU training. Running without a proper PyTorch distributed setup (e.g., `torch.distributed.init_process_group`) and available CUDA devices will lead to errors or severely limited functionality.
- breaking Megatron Core v0.14.0 added 'async support for DynamicInferenceEngine'. If you were using the inference engine in prior versions, this change might alter its behavior and require adapting existing inference pipelines.
- gotcha Megatron Core's performance heavily depends on specialized NVIDIA kernels (e.g., via Transformer Engine). Missing or incorrectly installed dependencies related to CUDA, cuDNN, or specialized libraries can lead to performance degradation or runtime errors.
Install
-
pip install megatron-core -
pip install 'megatron-core[cuda]' # For full CUDA/cuDNN integration if needed beyond base
Imports
- ColumnParallelLinear
from megatron.core.tensor_parallel.layers import ColumnParallelLinear
- RowParallelLinear
from megatron.core.tensor_parallel.layers import RowParallelLinear
- TransformerBlock
from megatron.core.transformer.transformer_block import TransformerBlock
- TransformerLayer
from megatron.core.transformer.transformer_layer import TransformerLayer
- dist_init
import megatron.core.dist_init
from megatron.core import dist_init
Quickstart
import os
import torch
import torch.distributed as dist
from megatron.core.tensor_parallel.layers import ColumnParallelLinear
from megatron.core import dist_init
# Minimal distributed setup for demonstration purposes.
# In a real scenario, these env vars would be set by a launcher (e.g., torchrun)
# and dist.init_process_group would be called globally.
if not dist.is_initialized():
os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')
os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')
os.environ['RANK'] = os.environ.get('RANK', '0')
# Set WORLD_SIZE to 1 for a single-GPU test without a full distributed setup
os.environ['WORLD_SIZE'] = os.environ.get('WORLD_SIZE', '1')
if torch.cuda.is_available() and int(os.environ['WORLD_SIZE']) > 0:
try:
dist.init_process_group(backend='nccl', rank=int(os.environ['RANK']), world_size=int(os.environ['WORLD_SIZE']))
print("PyTorch distributed group initialized with NCCL.")
except Exception as e:
print(f"Warning: Could not initialize NCCL backend: {e}. Falling back to CPU/non-distributed.")
os.environ['WORLD_SIZE'] = '1'
if dist.is_initialized(): # Destroy if partial init failed
dist.destroy_process_group()
else:
print("Warning: CUDA not available or WORLD_SIZE=0. Skipping torch.distributed init.")
os.environ['WORLD_SIZE'] = '1'
# Set Megatron-Core specific parallel configuration
# This is crucial for Megatron-Core layers to correctly interpret parallel ranks.
if dist.is_initialized():
dist_init.set_tensor_model_parallel_world_size(int(os.environ['WORLD_SIZE']))
dist_init.set_tensor_model_parallel_rank(int(os.environ['RANK']))
else:
# Fallback for CPU-only or non-distributed setup (effectively no parallelism)
dist_init.set_tensor_model_parallel_world_size(1)
dist_init.set_tensor_model_parallel_rank(0)
# Define a simple parallel linear layer
hidden_size = 128
output_size = 256
try:
# ColumnParallelLinear shards the input tensor across GPUs.
# If world_size > 1, each rank will only compute a part of the output.
# gather_output=True means the output is gathered on all ranks at the end.
linear_layer = ColumnParallelLinear(
input_size=hidden_size,
output_size=output_size,
gather_output=True
)
if torch.cuda.is_available():
linear_layer.cuda()
# Create a dummy input tensor
# Input size should match hidden_size. Batch and sequence length can vary.
input_tensor = torch.randn(2, 4, hidden_size)
if torch.cuda.is_available():
input_tensor = input_tensor.cuda()
# Perform a forward pass
output_tensor = linear_layer(input_tensor)
print(f"\nMegatron-Core ColumnParallelLinear initialized successfully.")
print(f"Input shape: {input_tensor.shape}")
print(f"Output shape (gathered): {output_tensor.shape}")
print(f"Output device: {output_tensor.device}")
except Exception as e:
print(f"An error occurred during Megatron-Core layer execution: {e}")
finally:
# Clean up distributed process group if initialized
if dist.is_initialized():
dist.destroy_process_group()