{"id":7403,"library":"megatron-core","title":"NVIDIA Megatron Core","description":"Megatron Core is a Python library developed by NVIDIA for building highly efficient and scalable transformer-based models, especially for large-scale distributed training. It provides fundamental building blocks for tensor and pipeline parallelism. The current version is 0.16.1, and it generally follows an active release cadence with minor versions released frequently.","status":"active","version":"0.16.1","language":"en","source_language":"en","source_url":"https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core","tags":["AI","LLM","Transformers","Distributed Training","Deep Learning","NVIDIA","GPU"],"install":[{"cmd":"pip install megatron-core","lang":"bash","label":"Install latest version"},{"cmd":"pip install 'megatron-core[cuda]' # For full CUDA/cuDNN integration if needed beyond base","lang":"bash","label":"Install with CUDA extras (optional)"}],"dependencies":[{"reason":"Megatron Core relies heavily on PyTorch for its deep learning functionalities and distributed primitives (torch.distributed).","package":"torch","optional":false},{"reason":"Many performance-critical operations depend on NVIDIA libraries, especially if not using a pre-packaged PyTorch with CUDA.","package":"nvidia-cublas-cu12","optional":true}],"imports":[{"symbol":"ColumnParallelLinear","correct":"from megatron.core.tensor_parallel.layers import ColumnParallelLinear"},{"symbol":"RowParallelLinear","correct":"from megatron.core.tensor_parallel.layers import RowParallelLinear"},{"symbol":"TransformerBlock","correct":"from megatron.core.transformer.transformer_block import TransformerBlock"},{"symbol":"TransformerLayer","correct":"from megatron.core.transformer.transformer_layer import TransformerLayer"},{"note":"The `dist_init` module is typically imported directly from `megatron.core` after the package's top-level `__init__.py` exposes it.","wrong":"import megatron.core.dist_init","symbol":"dist_init","correct":"from megatron.core import dist_init"}],"quickstart":{"code":"import os\nimport torch\nimport torch.distributed as dist\nfrom megatron.core.tensor_parallel.layers import ColumnParallelLinear\nfrom megatron.core import dist_init\n\n# Minimal distributed setup for demonstration purposes.\n# In a real scenario, these env vars would be set by a launcher (e.g., torchrun)\n# and dist.init_process_group would be called globally.\nif not dist.is_initialized():\n    os.environ['MASTER_ADDR'] = os.environ.get('MASTER_ADDR', 'localhost')\n    os.environ['MASTER_PORT'] = os.environ.get('MASTER_PORT', '29500')\n    os.environ['RANK'] = os.environ.get('RANK', '0')\n    # Set WORLD_SIZE to 1 for a single-GPU test without a full distributed setup\n    os.environ['WORLD_SIZE'] = os.environ.get('WORLD_SIZE', '1') \n    \n    if torch.cuda.is_available() and int(os.environ['WORLD_SIZE']) > 0:\n        try:\n            dist.init_process_group(backend='nccl', rank=int(os.environ['RANK']), world_size=int(os.environ['WORLD_SIZE']))\n            print(\"PyTorch distributed group initialized with NCCL.\")\n        except Exception as e:\n            print(f\"Warning: Could not initialize NCCL backend: {e}. Falling back to CPU/non-distributed.\")\n            os.environ['WORLD_SIZE'] = '1'\n            if dist.is_initialized(): # Destroy if partial init failed\n                dist.destroy_process_group()\n    else:\n        print(\"Warning: CUDA not available or WORLD_SIZE=0. Skipping torch.distributed init.\")\n        os.environ['WORLD_SIZE'] = '1'\n\n# Set Megatron-Core specific parallel configuration\n# This is crucial for Megatron-Core layers to correctly interpret parallel ranks.\nif dist.is_initialized():\n    dist_init.set_tensor_model_parallel_world_size(int(os.environ['WORLD_SIZE']))\n    dist_init.set_tensor_model_parallel_rank(int(os.environ['RANK']))\nelse:\n    # Fallback for CPU-only or non-distributed setup (effectively no parallelism)\n    dist_init.set_tensor_model_parallel_world_size(1)\n    dist_init.set_tensor_model_parallel_rank(0)\n\n# Define a simple parallel linear layer\nhidden_size = 128\noutput_size = 256\n\ntry:\n    # ColumnParallelLinear shards the input tensor across GPUs.\n    # If world_size > 1, each rank will only compute a part of the output.\n    # gather_output=True means the output is gathered on all ranks at the end.\n    linear_layer = ColumnParallelLinear(\n        input_size=hidden_size,\n        output_size=output_size,\n        gather_output=True\n    )\n    if torch.cuda.is_available():\n        linear_layer.cuda()\n\n    # Create a dummy input tensor\n    # Input size should match hidden_size. Batch and sequence length can vary.\n    input_tensor = torch.randn(2, 4, hidden_size)\n    if torch.cuda.is_available():\n        input_tensor = input_tensor.cuda()\n\n    # Perform a forward pass\n    output_tensor = linear_layer(input_tensor)\n\n    print(f\"\\nMegatron-Core ColumnParallelLinear initialized successfully.\")\n    print(f\"Input shape: {input_tensor.shape}\")\n    print(f\"Output shape (gathered): {output_tensor.shape}\")\n    print(f\"Output device: {output_tensor.device}\")\n\nexcept Exception as e:\n    print(f\"An error occurred during Megatron-Core layer execution: {e}\")\n\nfinally:\n    # Clean up distributed process group if initialized\n    if dist.is_initialized():\n        dist.destroy_process_group()\n","lang":"python","description":"This quickstart demonstrates how to initialize a basic distributed environment (required for Megatron-Core components) and instantiate a `ColumnParallelLinear` layer. It showcases the fundamental usage pattern of defining a parallelized model component. For actual distributed training, `torch.distributed.launch` or `torchrun` should be used to set up the environment variables."},"warnings":[{"fix":"Review the official changelog and documentation for `megatron.core.transformer_engine` for updated buffer handling methods.","message":"Megatron Core v0.15.0 introduced a 'new TE interface for user buffers'. Custom integrations or extensions that directly interfaced with lower-level buffers might require updates to conform to the new API.","severity":"breaking","affected_versions":">=0.15.0"},{"fix":"Ensure you have NVIDIA GPUs, CUDA drivers, and `torch` with CUDA support installed. Always initialize `torch.distributed` before using Megatron-Core components, typically via a launcher like `torchrun` or `deepspeed`.","message":"Megatron Core is fundamentally designed for distributed GPU training. Running without a proper PyTorch distributed setup (e.g., `torch.distributed.init_process_group`) and available CUDA devices will lead to errors or severely limited functionality.","severity":"gotcha","affected_versions":"All"},{"fix":"Consult the `megatron.core.inference_engine` documentation for updated asynchronous inference patterns and API changes.","message":"Megatron Core v0.14.0 added 'async support for DynamicInferenceEngine'. If you were using the inference engine in prior versions, this change might alter its behavior and require adapting existing inference pipelines.","severity":"breaking","affected_versions":">=0.14.0"},{"fix":"Ensure your environment meets all system requirements. Consider installing `megatron-core` with `[cuda]` extras and verifying CUDA and cuDNN installations are compatible with your PyTorch version.","message":"Megatron Core's performance heavily depends on specialized NVIDIA kernels (e.g., via Transformer Engine). Missing or incorrectly installed dependencies related to CUDA, cuDNN, or specialized libraries can lead to performance degradation or runtime errors.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Verify NVIDIA GPU presence, CUDA driver installation, and PyTorch's CUDA version compatibility (`torch.cuda.is_available()`, `torch.version.cuda`). Ensure necessary environment variables like `LD_LIBRARY_PATH` are set.","cause":"The system lacks a compatible NVIDIA GPU or CUDA drivers are not correctly installed/configured for PyTorch to detect them.","error":"RuntimeError: CUDA error: no CUDA-capable device is detected"},{"fix":"Call `megatron.core.dist_init.set_tensor_model_parallel_world_size()` and `set_tensor_model_parallel_rank()` after `torch.distributed.init_process_group()` has been successfully executed.","cause":"Megatron Core's internal distributed configuration (e.g., tensor, pipeline parallelism) has not been initialized. This is crucial for its parallel layers.","error":"ValueError: tensor_model_parallel_world_size is not set"},{"fix":"Check network connectivity between nodes, verify `MASTER_ADDR` and `MASTER_PORT` environment variables are correctly set and accessible, and ensure the number of launched processes matches `WORLD_SIZE`.","cause":"NCCL (NVIDIA Collective Communications Library) initialization failed, often due to network issues, firewall restrictions, incorrect `MASTER_ADDR`/`MASTER_PORT`, or an insufficient number of processes for the `WORLD_SIZE`.","error":"torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1678278270412/work/torch/csrc/distributed/c10d/NCCLUtils.cpp:218, unhandled system error (aborting at rank 0)"},{"fix":"Consult the official Megatron-LM GitHub repository or documentation for the correct import path for your `megatron-core` version. Ensure your `megatron-core` installation is up-to-date with `pip install --upgrade megatron-core`.","cause":"The module or class name might have changed, been moved to a different submodule, or your `megatron-core` version is old/incompatible with the code you're running.","error":"ImportError: cannot import name 'TransformerBlock' from 'megatron.core.transformer'"}]}