Triton-Ascend
Triton-Ascend is a compilation framework built for the Ascend platform, designed to enable Triton code to run efficiently on Ascend NPUs. It aims to reduce operator development difficulty by automatically handling memory allocation, data transfer, and computation based on hardware characteristics. The project is actively developed, with version 3.2.0 released, and plans to align with Triton 3.5.x in 2026.
Warnings
- breaking Upstream Triton 3.5.x introduced significant Python API refactoring (e.g., to `semantic.py`) and changes in LLVM/MLIR APIs (e.g., `bufferization::ToMemrefOp` to `bufferization::ToBufferOp`, stride/offset API migration). While Triton-Ascend plans to align with 3.5.x, these changes necessitate adaptations in backend code and may affect custom Triton operators written for older versions.
- gotcha Community Triton and Triton-Ascend cannot coexist in the same environment. Installing other software that implicitly depends on and installs 'community Triton' will overwrite your Triton-Ascend installation, leading to unexpected behavior or errors.
- gotcha The Ascend NPU's `coreDim` parameter has a limit (UINT16_MAX, 65535). For large-scale data, a naive grid division might exceed this limit, preventing kernel launch or causing errors.
- gotcha Triton compilation can fail on Ascend NPUs if the `--target` flag is not correctly recognized (e.g., `--target=Ascend310P3`), resulting in a `Cannot find option named 'Ascend310P3!'` error and a fatal `EngineDeadError`. This has been observed with vLLM Ascend integration.
- gotcha Triton-Ascend's backend may not compile 2D masked `tl.store` operations, leading to compilation errors (e.g., at the `ttir_to_linalg` stage).
- gotcha Migrating Triton operators from NVIDIA GPUs to Ascend NPUs requires significant architectural considerations, including shifting from GPU's 'logical grid flexibility' to Ascend's 'physical core group binding', enforcing 32-byte or 512-byte memory alignment, and removing GPU-specific synchronization APIs.
Install
-
pip install triton-ascend
Imports
- triton
import triton
- triton.language
import triton.language as tl
Quickstart
import os
import torch
import triton
import triton.language as tl
# Ensure Ascend NPU environment is set up. This is usually done via `source /path/to/Ascend/ascend-toolkit/set_env.sh`
# For demonstration, we assume 'npu' device is available through torch_npu.
# You might need to install torch and torch_npu compatible with your CANN version, e.g.:
# pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
# pip install torch_npu==2.6.0
# The actual import `torch_npu` might be handled implicitly by Ascend's PyTorch backend setup.
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def main():
if not torch.npu.is_available():
print("Ascend NPU not available. Please ensure CANN and torch_npu are correctly installed and configured.")
return
print(f"Using NPU device: {torch.npu.get_device_name(0)}")
N = 1024 * 128
# Allocate memory on NPU
x = torch.randn(N, device='npu', dtype=torch.float32)
y = torch.randn(N, device='npu', dtype=torch.float32)
output = torch.empty_like(x, device='npu')
# Define the grid and block size
BLOCK_SIZE = 1024
grid = lambda META: (triton.cdiv(N, META['BLOCK_SIZE']),)
# Launch the kernel
print("Launching Triton-Ascend kernel...")
add_kernel[grid](x, y, output, N, BLOCK_SIZE=BLOCK_SIZE)
# Verify results
torch_output = x + y
assert torch.allclose(output, torch_output, atol=1e-5, rtol=1e-5)
print("Kernel execution successful and results verified!")
if __name__ == '__main__':
# It's crucial to set up the Ascend CANN environment variables before running.
# Example: os.environ['ASCEND_TOOLKIT_PATH'] = '/usr/local/Ascend/ascend-toolkit'
# Or ensure your shell environment has sourced the set_env.sh script.
try:
import torch_npu
main()
except ImportError:
print("torch_npu not found. Please install it with `pip install torch_npu` (ensure compatibility with your Ascend CANN version).")
except Exception as e:
print(f"An error occurred: {e}")