Triton for Windows
Triton-windows is a community-maintained fork of the Triton language and compiler, specifically tailored to support Deep Learning operations on Windows. It provides a highly optimized solution for defining and launching custom GPU kernels, enabling high-performance computing in Python environments on Windows machines. The library's current version is 3.6.0.post26, with releases closely following the upstream Triton project, often including Windows-specific bug fixes and performance enhancements.
Warnings
- breaking Each major version of Triton-windows has strict compatibility requirements with specific PyTorch versions. For example, Triton 3.6 requires PyTorch >= 2.10, Triton 3.5 requires PyTorch >= 2.9, and Triton 3.4 requires PyTorch >= 2.8. Installing a mismatched version will lead to runtime errors or incorrect behavior.
- gotcha To prevent automatic updates of `triton-windows` from breaking compatibility with your installed PyTorch (due to the strict versioning explained above), it's highly recommended to pin the `triton-windows` version during installation.
- gotcha Windows' path length limit (260 characters) can cause issues with Triton's cache directory, leading to compilation failures or 'file not found' errors. This was a common problem in older versions.
- gotcha While initial support for AMD GPUs (with TheRock) was introduced in `3.5.1-windows.post23`, and further fixes in `3.6.0-windows.post25`, AMD GPU support is still evolving. Users might encounter specific bugs or limitations not present on NVIDIA GPUs.
Install
-
pip install -U "triton-windows<3.7"
Imports
- triton
import triton
- triton.language
import triton.language as tl
- triton.jit
from triton import jit
Quickstart
import torch
import triton
import triton.language as tl
# Define a simple Triton kernel for vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)
def add(x: torch.Tensor, y: torch.Tensor):
# Ensure inputs are contiguous and on a CUDA device
assert x.is_cuda and y.is_cuda, "Inputs must be on a CUDA device"
assert x.shape == y.shape
n_elements = x.numel()
# Allocate output tensor
output = torch.empty_like(x)
# Calculate grid dimension based on BLOCK_SIZE
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
# Launch the kernel
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return output
if __name__ == "__main__":
if not torch.cuda.is_available():
print("CUDA not available. Triton requires a GPU.")
else:
print("CUDA is available, running Triton example...")
size = 2**20 # 1 million elements
x = torch.rand(size, device='cuda')
y = torch.rand(size, device='cuda')
output = add(x, y)
# Verify correctness
expected_output = x + y
assert torch.allclose(output, expected_output, atol=1e-5), "Triton output mismatch!"
print("Triton vector addition successful!")
print("First 5 elements of Triton output:", output[:5])
print("First 5 elements of PyTorch output:", expected_output[:5])