Triton-Ascend

3.2.0 · active · verified Mon Apr 13

Triton-Ascend is a compilation framework built for the Ascend platform, designed to enable Triton code to run efficiently on Ascend NPUs. It aims to reduce operator development difficulty by automatically handling memory allocation, data transfer, and computation based on hardware characteristics. The project is actively developed, with version 3.2.0 released, and plans to align with Triton 3.5.x in 2026.

Warnings

Install

Imports

Quickstart

This example demonstrates a basic vector addition using a Triton-Ascend kernel. It initializes two tensors on the Ascend NPU, defines a simple JIT-compiled kernel, launches it, and then verifies the results against a standard PyTorch operation. Requires Ascend CANN and `torch_npu` to be correctly installed and configured in the environment.

import os
import torch
import triton
import triton.language as tl

# Ensure Ascend NPU environment is set up. This is usually done via `source /path/to/Ascend/ascend-toolkit/set_env.sh`
# For demonstration, we assume 'npu' device is available through torch_npu.
# You might need to install torch and torch_npu compatible with your CANN version, e.g.:
# pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
# pip install torch_npu==2.6.0
# The actual import `torch_npu` might be handled implicitly by Ascend's PyTorch backend setup.

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def main():
    if not torch.npu.is_available():
        print("Ascend NPU not available. Please ensure CANN and torch_npu are correctly installed and configured.")
        return

    print(f"Using NPU device: {torch.npu.get_device_name(0)}")
    N = 1024 * 128
    # Allocate memory on NPU
    x = torch.randn(N, device='npu', dtype=torch.float32)
    y = torch.randn(N, device='npu', dtype=torch.float32)
    output = torch.empty_like(x, device='npu')

    # Define the grid and block size
    BLOCK_SIZE = 1024
    grid = lambda META: (triton.cdiv(N, META['BLOCK_SIZE']),)

    # Launch the kernel
    print("Launching Triton-Ascend kernel...")
    add_kernel[grid](x, y, output, N, BLOCK_SIZE=BLOCK_SIZE)

    # Verify results
    torch_output = x + y
    assert torch.allclose(output, torch_output, atol=1e-5, rtol=1e-5)
    print("Kernel execution successful and results verified!")

if __name__ == '__main__':
    # It's crucial to set up the Ascend CANN environment variables before running.
    # Example: os.environ['ASCEND_TOOLKIT_PATH'] = '/usr/local/Ascend/ascend-toolkit'
    # Or ensure your shell environment has sourced the set_env.sh script.
    try:
        import torch_npu
        main()
    except ImportError:
        print("torch_npu not found. Please install it with `pip install torch_npu` (ensure compatibility with your Ascend CANN version).")
    except Exception as e:
        print(f"An error occurred: {e}")

view raw JSON →