{"id":5528,"library":"triton-ascend","title":"Triton-Ascend","description":"Triton-Ascend is a compilation framework built for the Ascend platform, designed to enable Triton code to run efficiently on Ascend NPUs. It aims to reduce operator development difficulty by automatically handling memory allocation, data transfer, and computation based on hardware characteristics. The project is actively developed, with version 3.2.0 released, and plans to align with Triton 3.5.x in 2026.","status":"active","version":"3.2.0","language":"en","source_language":"en","source_url":"https://github.com/triton-lang/triton-ascend","tags":["deep learning","compiler","hardware acceleration","NPU","Ascend","Triton","Huawei"],"install":[{"cmd":"pip install triton-ascend","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for PyTorch integration and NPU device support. Specific versions of torch and torch_npu are usually required for compatibility, e.g., torch_npu==2.7.1 for Triton-Ascend 3.2.0.","package":"torch_npu","optional":false},{"reason":"A fundamental software stack for Huawei Ascend AI processors, essential for Triton-Ascend to function. Version 8.5.0 is recommended.","package":"Ascend CANN Community Edition","optional":false},{"reason":"Requires Python versions 3.9 to 3.11.","package":"Python","optional":false},{"reason":"System dependency, requires GCC >= 9.4.0.","package":"GCC","optional":false},{"reason":"System dependency, requires GLIBC >= 2.27.","package":"GLIBC","optional":false}],"imports":[{"symbol":"triton","correct":"import triton"},{"symbol":"triton.language","correct":"import triton.language as tl"}],"quickstart":{"code":"import os\nimport torch\nimport triton\nimport triton.language as tl\n\n# Ensure Ascend NPU environment is set up. This is usually done via `source /path/to/Ascend/ascend-toolkit/set_env.sh`\n# For demonstration, we assume 'npu' device is available through torch_npu.\n# You might need to install torch and torch_npu compatible with your CANN version, e.g.:\n# pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu\n# pip install torch_npu==2.6.0\n# The actual import `torch_npu` might be handled implicitly by Ascend's PyTorch backend setup.\n\n@triton.jit\ndef add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):\n    pid = tl.program_id(axis=0)\n    block_start = pid * BLOCK_SIZE\n    offsets = block_start + tl.arange(0, BLOCK_SIZE)\n    mask = offsets < n_elements\n    x = tl.load(x_ptr + offsets, mask=mask)\n    y = tl.load(y_ptr + offsets, mask=mask)\n    output = x + y\n    tl.store(output_ptr + offsets, output, mask=mask)\n\ndef main():\n    if not torch.npu.is_available():\n        print(\"Ascend NPU not available. Please ensure CANN and torch_npu are correctly installed and configured.\")\n        return\n\n    print(f\"Using NPU device: {torch.npu.get_device_name(0)}\")\n    N = 1024 * 128\n    # Allocate memory on NPU\n    x = torch.randn(N, device='npu', dtype=torch.float32)\n    y = torch.randn(N, device='npu', dtype=torch.float32)\n    output = torch.empty_like(x, device='npu')\n\n    # Define the grid and block size\n    BLOCK_SIZE = 1024\n    grid = lambda META: (triton.cdiv(N, META['BLOCK_SIZE']),)\n\n    # Launch the kernel\n    print(\"Launching Triton-Ascend kernel...\")\n    add_kernel[grid](x, y, output, N, BLOCK_SIZE=BLOCK_SIZE)\n\n    # Verify results\n    torch_output = x + y\n    assert torch.allclose(output, torch_output, atol=1e-5, rtol=1e-5)\n    print(\"Kernel execution successful and results verified!\")\n\nif __name__ == '__main__':\n    # It's crucial to set up the Ascend CANN environment variables before running.\n    # Example: os.environ['ASCEND_TOOLKIT_PATH'] = '/usr/local/Ascend/ascend-toolkit'\n    # Or ensure your shell environment has sourced the set_env.sh script.\n    try:\n        import torch_npu\n        main()\n    except ImportError:\n        print(\"torch_npu not found. Please install it with `pip install torch_npu` (ensure compatibility with your Ascend CANN version).\")\n    except Exception as e:\n        print(f\"An error occurred: {e}\")","lang":"python","description":"This example demonstrates a basic vector addition using a Triton-Ascend kernel. It initializes two tensors on the Ascend NPU, defines a simple JIT-compiled kernel, launches it, and then verifies the results against a standard PyTorch operation. Requires Ascend CANN and `torch_npu` to be correctly installed and configured in the environment."},"warnings":[{"fix":"Review migration guides for Triton-Ascend when upgrading to versions based on Triton 3.5.x or newer. Adapt custom operators to the new API patterns, particularly concerning semantic functions and MLIR bufferization.","message":"Upstream Triton 3.5.x introduced significant Python API refactoring (e.g., to `semantic.py`) and changes in LLVM/MLIR APIs (e.g., `bufferization::ToMemrefOp` to `bufferization::ToBufferOp`, stride/offset API migration). While Triton-Ascend plans to align with 3.5.x, these changes necessitate adaptations in backend code and may affect custom Triton operators written for older versions.","severity":"breaking","affected_versions":">=3.5.x (upstream Triton), relevant for future Triton-Ascend versions aiming to align"},{"fix":"Always uninstall any community Triton installations before installing Triton-Ascend. When managing environments, prioritize Triton-Ascend and be mindful of dependencies that might pull in incompatible Triton versions.","message":"Community Triton and Triton-Ascend cannot coexist in the same environment. Installing other software that implicitly depends on and installs 'community Triton' will overwrite your Triton-Ascend installation, leading to unexpected behavior or errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Adjust the `BLOCK_SIZE` in your Triton kernel to reduce the number of required cores (`coreDim = ceil(N / BLOCK_SIZE)`), ensuring `coreDim` remains within the 65535 limit. Use `triton.next_power_of_2(triton.cdiv(N, 65535))` to find a safe minimum `BLOCK_SIZE`.","message":"The Ascend NPU's `coreDim` parameter has a limit (UINT16_MAX, 65535). For large-scale data, a naive grid division might exceed this limit, preventing kernel launch or causing errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that the Ascend CANN environment and Triton-Ascend installation correctly register the NPU target options. Check documentation for specific `bishengir-compile` options and ensure compatibility between Triton-Ascend and the NPU driver version. Report the issue if it persists with officially supported configurations.","message":"Triton compilation can fail on Ascend NPUs if the `--target` flag is not correctly recognized (e.g., `--target=Ascend310P3`), resulting in a `Cannot find option named 'Ascend310P3!'` error and a fatal `EngineDeadError`. This has been observed with vLLM Ascend integration.","severity":"gotcha","affected_versions":"Specific versions, notably when integrating with frameworks like vLLM where target flags are implicitly passed."},{"fix":"If encountering issues with 2D masked `tl.store`, refactor the kernel to use a row-wise or 1D masked `tl.store` pattern as a workaround, or implement equivalent functionality using supported operations.","message":"Triton-Ascend's backend may not compile 2D masked `tl.store` operations, leading to compilation errors (e.g., at the `ttir_to_linalg` stage).","severity":"gotcha","affected_versions":"All versions up to 3.2.0, potentially fixed in future releases."},{"fix":"Thoroughly review the Triton-Ascend migration guide. Adapt grid dimensions to match physical NPU core counts, ensure proper memory alignment (e.g., `32-byte` for VV, `512-byte` for CV scenarios), and replace any GPU-specific synchronization with Ascend-compatible mechanisms.","message":"Migrating Triton operators from NVIDIA GPUs to Ascend NPUs requires significant architectural considerations, including shifting from GPU's 'logical grid flexibility' to Ascend's 'physical core group binding', enforcing 32-byte or 512-byte memory alignment, and removing GPU-specific synchronization APIs.","severity":"gotcha","affected_versions":"All versions, for users migrating existing Triton code."}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}