Triton

3.6.0 verified Tue May 12 auth: no python install: draft quickstart: stale

Triton is a language and compiler for writing highly efficient custom Deep Learning operations. It provides a Python-based programming environment for writing custom GPU kernels that can achieve performance on par with hand-tuned CUDA, but with higher productivity and flexibility than other existing DSLs. Triton aims to bridge the gap between high-level deep learning frameworks and low-level GPU programming. The current version is 3.6.0, with frequent releases (multiple major/minor releases per year).

pip install triton

Common errors

error ModuleNotFoundError: No module named 'triton' ↓

cause The Triton library is not installed in the current Python environment or is not accessible via the Python path.

fix

pip install triton

error triton.compiler.code_generator.CompilationError: unsupported scalar type: i64 ↓

cause Triton kernels have limited support for `i64` or `f64` types, often preferring `i32` or `f32` for performance and compatibility across different hardware architectures.

fix

Convert i64 or f64 inputs to i32 or f32 within the kernel using tl.cast where possible, or ensure your specific GPU and Triton version support the desired type.

error triton.runtime.autotune.OutOfResources: Failed to launch with config ↓

cause The chosen kernel configuration (e.g., block size, number of warps, shared memory usage) exceeds the available resources or limits of the GPU, or the auto-tuner couldn't find a valid launch configuration.

fix

Adjust the kernel configuration parameters (e.g., reduce BLOCK_SIZE, NUM_WARPS, or SHARED_MEMORY_SIZE) or provide a smaller, more feasible range for the auto-tuner to explore.

error AttributeError: 'NinjaBuildExtension' object has no attribute 'get_ext_filename' ↓

cause This error typically occurs during Triton installation due to an incompatibility between Triton's build system and an outdated `setuptools` version.

fix

pip install --upgrade setuptools

Warnings

breaking Triton 3.4.0 dropped support for Python 3.8. The minimum required Python version is now 3.10, and it supports up to 3.14 (i.e., <3.15). Ensure your Python environment meets these requirements. ↓

fix Upgrade your Python version to 3.10 or newer (but less than 3.15).

breaking In Triton 3.0.0, the behavior of `tl.constexpr` changed. You can no longer directly call non-Triton functions (e.g., `math.log2`) within a JIT function and assign their results to `tl.constexpr` variables. These values must be pre-computed outside the kernel or implemented with `triton.language` equivalents. ↓

fix Pre-compute values outside the JIT-compiled kernel or use `triton.language` math functions where available. For example, assign `log2e: tl.constexpr = 1.4426950408889634` instead of `log2e: tl.constexpr = math.log2(math.e)`.

gotcha Triton primarily supports Linux with NVIDIA GPUs (Compute Capability 7.0 or higher, Volta generation or newer). AMD GPU support is in development. Official Windows and macOS binaries are not provided; WSL2 is the recommended workaround for Windows. An up-to-date NVIDIA driver is critical for PTX JIT compilation. Support for NVIDIA GPUs with Turing architecture (sm75, e.g., GTX 16xx/RTX 20xx) was dropped starting from Triton 3.3. ↓

fix Ensure you are running on a supported Linux environment with a compatible NVIDIA GPU and the latest drivers. For Windows, use WSL2. Verify your GPU's compute capability if experiencing issues on older hardware.

gotcha Triton 3.5.0 introduced a bug that broke `sm103` (NVIDIA GB200/GB300) support. This was quickly patched in the 3.5.1 bug fix release. ↓

fix If targeting NVIDIA GB200/GB300 GPUs, ensure you are using Triton 3.5.1 or a later version.

gotcha The official Triton library currently restricts `fp8` (float8) data type support to NVIDIA GPUs with compute capability >= 8.9 (e.g., RTX 40xx and newer). It is not officially supported on Ampere (RTX 30xx) or older architectures. ↓

fix Use a supported GPU for `fp8` operations or consider using other data types. Some community forks or specific `triton-windows` builds might offer extended `fp8` support on older hardware.

gotcha Triton stores cache files in `~/.triton` by default. This can lead to conflicts or unexpected behavior when using different versions or forks of Triton, or when building self-contained applications. There are currently no official environment variables to override all cache-related directories. ↓

fix Monitor the `~/.triton` directory for unexpected files. For specific control over some aspects, environment variables like `TRITON_HOME` can change the root of the cache directory. Consider contributing to add more granular control over cache locations.

breaking The 'torch' module is a required dependency for Triton. This error indicates that PyTorch is not installed in the environment. ↓

fix Install PyTorch in your environment using `pip install torch` (or specific instructions from pytorch.org based on your hardware and CUDA requirements).

Install compatibility draft last tested: 2026-05-12

python os / libc status wheel install import disk

3.10 alpine (musl) build_error - - - -

3.10 alpine (musl) - - - -

3.10 slim (glibc) wheel 6.8s 0.33s 716M

3.10 slim (glibc) - - 0.25s 658M

3.11 alpine (musl) build_error - - - -

3.11 alpine (musl) - - - -

3.11 slim (glibc) wheel 6.5s 0.68s 720M

3.11 slim (glibc) - - 0.49s 661M

3.12 alpine (musl) build_error - - - -

3.12 alpine (musl) - - - -

3.12 slim (glibc) wheel 6.5s 0.45s 711M

3.12 slim (glibc) - - 0.39s 653M

3.13 alpine (musl) build_error - - - -

3.13 alpine (musl) - - - -

3.13 slim (glibc) wheel 6.2s 0.41s 711M

3.13 slim (glibc) - - 0.34s 653M

3.9 alpine (musl) build_error - - - -

3.9 alpine (musl) - - - -

3.9 slim (glibc) wheel 6.0s 0.29s 558M

3.9 slim (glibc) - - 0.30s 558M

Imports

triton
```
import triton
```
triton.language as tl
```
import triton.language as tl
```
triton.jit
```
@triton.jit
```

Quickstart stale last tested: 2026-04-23

This quickstart demonstrates a basic vector addition kernel using Triton. It shows how to define a JIT-compiled kernel with `@triton.jit`, load and store data using `triton.language` primitives like `tl.load` and `tl.store`, and launch the kernel from Python with a specified grid size. This example processes elements in blocks, illustrating Triton's approach to GPU parallelism.

import triton
import triton.language as tl
import torch

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    # Map program_id to a block of elements
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)

    # Create a mask to handle out-of-bounds accesses
    mask = offsets < n_elements

    # Load data from memory
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)

    # Perform addition
    output = x + y

    # Write back to memory
    tl.store(output_ptr + offsets, output, mask=mask)


def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = x.numel()

    # The block size is a compile-time constant, so we can't use `n_elements`
    # directly. Instead, we use a heuristic to choose a good block size.
    BLOCK_SIZE = 1024 # Or adjust based on your needs

    # Number of programs (blocks) to launch
    grid = lambda META: (triton.cdiv(n_elements, META['BLOCK_SIZE']),)

    # Launch the kernel
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=BLOCK_SIZE)
    return output

if __name__ == "__main__":
    # Example usage with PyTorch tensors
    size = 4096
    x = torch.randn(size, device='cuda')
    y = torch.randn(size, device='cuda')

    output_triton = add(x, y)
    output_torch = x + y

    print(f"Triton output matches PyTorch: {torch.allclose(output_triton, output_torch)}")