Triton
raw JSON → 3.6.0 verified Tue May 12 auth: no python install: draft quickstart: stale
Triton is a language and compiler for writing highly efficient custom Deep Learning operations. It provides a Python-based programming environment for writing custom GPU kernels that can achieve performance on par with hand-tuned CUDA, but with higher productivity and flexibility than other existing DSLs. Triton aims to bridge the gap between high-level deep learning frameworks and low-level GPU programming. The current version is 3.6.0, with frequent releases (multiple major/minor releases per year).
pip install triton Common errors
error ModuleNotFoundError: No module named 'triton' ↓
cause The Triton library is not installed in the current Python environment or is not accessible via the Python path.
fix
pip install triton
error triton.compiler.code_generator.CompilationError: unsupported scalar type: i64 ↓
cause Triton kernels have limited support for `i64` or `f64` types, often preferring `i32` or `f32` for performance and compatibility across different hardware architectures.
fix
Convert
i64 or f64 inputs to i32 or f32 within the kernel using tl.cast where possible, or ensure your specific GPU and Triton version support the desired type. error triton.runtime.autotune.OutOfResources: Failed to launch with config ↓
cause The chosen kernel configuration (e.g., block size, number of warps, shared memory usage) exceeds the available resources or limits of the GPU, or the auto-tuner couldn't find a valid launch configuration.
fix
Adjust the kernel configuration parameters (e.g., reduce
BLOCK_SIZE, NUM_WARPS, or SHARED_MEMORY_SIZE) or provide a smaller, more feasible range for the auto-tuner to explore. error AttributeError: 'NinjaBuildExtension' object has no attribute 'get_ext_filename' ↓
cause This error typically occurs during Triton installation due to an incompatibility between Triton's build system and an outdated `setuptools` version.
fix
pip install --upgrade setuptools
Warnings
breaking Triton 3.4.0 dropped support for Python 3.8. The minimum required Python version is now 3.10, and it supports up to 3.14 (i.e., <3.15). Ensure your Python environment meets these requirements. ↓
fix Upgrade your Python version to 3.10 or newer (but less than 3.15).
breaking In Triton 3.0.0, the behavior of `tl.constexpr` changed. You can no longer directly call non-Triton functions (e.g., `math.log2`) within a JIT function and assign their results to `tl.constexpr` variables. These values must be pre-computed outside the kernel or implemented with `triton.language` equivalents. ↓
fix Pre-compute values outside the JIT-compiled kernel or use `triton.language` math functions where available. For example, assign `log2e: tl.constexpr = 1.4426950408889634` instead of `log2e: tl.constexpr = math.log2(math.e)`.
gotcha Triton primarily supports Linux with NVIDIA GPUs (Compute Capability 7.0 or higher, Volta generation or newer). AMD GPU support is in development. Official Windows and macOS binaries are not provided; WSL2 is the recommended workaround for Windows. An up-to-date NVIDIA driver is critical for PTX JIT compilation. Support for NVIDIA GPUs with Turing architecture (sm75, e.g., GTX 16xx/RTX 20xx) was dropped starting from Triton 3.3. ↓
fix Ensure you are running on a supported Linux environment with a compatible NVIDIA GPU and the latest drivers. For Windows, use WSL2. Verify your GPU's compute capability if experiencing issues on older hardware.
gotcha Triton 3.5.0 introduced a bug that broke `sm103` (NVIDIA GB200/GB300) support. This was quickly patched in the 3.5.1 bug fix release. ↓
fix If targeting NVIDIA GB200/GB300 GPUs, ensure you are using Triton 3.5.1 or a later version.
gotcha The official Triton library currently restricts `fp8` (float8) data type support to NVIDIA GPUs with compute capability >= 8.9 (e.g., RTX 40xx and newer). It is not officially supported on Ampere (RTX 30xx) or older architectures. ↓
fix Use a supported GPU for `fp8` operations or consider using other data types. Some community forks or specific `triton-windows` builds might offer extended `fp8` support on older hardware.
gotcha Triton stores cache files in `~/.triton` by default. This can lead to conflicts or unexpected behavior when using different versions or forks of Triton, or when building self-contained applications. There are currently no official environment variables to override all cache-related directories. ↓
fix Monitor the `~/.triton` directory for unexpected files. For specific control over some aspects, environment variables like `TRITON_HOME` can change the root of the cache directory. Consider contributing to add more granular control over cache locations.
breaking The 'torch' module is a required dependency for Triton. This error indicates that PyTorch is not installed in the environment. ↓
fix Install PyTorch in your environment using `pip install torch` (or specific instructions from pytorch.org based on your hardware and CUDA requirements).
Install compatibility draft last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) build_error - - - -
3.10 alpine (musl) - - - -
3.10 slim (glibc) wheel 6.8s 0.33s 716M
3.10 slim (glibc) - - 0.25s 658M
3.11 alpine (musl) build_error - - - -
3.11 alpine (musl) - - - -
3.11 slim (glibc) wheel 6.5s 0.68s 720M
3.11 slim (glibc) - - 0.49s 661M
3.12 alpine (musl) build_error - - - -
3.12 alpine (musl) - - - -
3.12 slim (glibc) wheel 6.5s 0.45s 711M
3.12 slim (glibc) - - 0.39s 653M
3.13 alpine (musl) build_error - - - -
3.13 alpine (musl) - - - -
3.13 slim (glibc) wheel 6.2s 0.41s 711M
3.13 slim (glibc) - - 0.34s 653M
3.9 alpine (musl) build_error - - - -
3.9 alpine (musl) - - - -
3.9 slim (glibc) wheel 6.0s 0.29s 558M
3.9 slim (glibc) - - 0.30s 558M
Imports
- triton
import triton - triton.language as tl
import triton.language as tl - triton.jit
@triton.jit
Quickstart stale last tested: 2026-04-23
import triton
import triton.language as tl
import torch
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
# Map program_id to a block of elements
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# Create a mask to handle out-of-bounds accesses
mask = offsets < n_elements
# Load data from memory
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
# Perform addition
output = x + y
# Write back to memory
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
output = torch.empty_like(x)
assert x.is_cuda and y.is_cuda and output.is_cuda
n_elements = x.numel()
# The block size is a compile-time constant, so we can't use `n_elements`
# directly. Instead, we use a heuristic to choose a good block size.
BLOCK_SIZE = 1024 # Or adjust based on your needs
# Number of programs (blocks) to launch
grid = lambda META: (triton.cdiv(n_elements, META['BLOCK_SIZE']),)
# Launch the kernel
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=BLOCK_SIZE)
return output
if __name__ == "__main__":
# Example usage with PyTorch tensors
size = 4096
x = torch.randn(size, device='cuda')
y = torch.randn(size, device='cuda')
output_triton = add(x, y)
output_torch = x + y
print(f"Triton output matches PyTorch: {torch.allclose(output_triton, output_torch)}")