Triton for Windows

3.6.0.post26 · active · verified Wed Apr 15

Triton-windows is a community-maintained fork of the Triton language and compiler, specifically tailored to support Deep Learning operations on Windows. It provides a highly optimized solution for defining and launching custom GPU kernels, enabling high-performance computing in Python environments on Windows machines. The library's current version is 3.6.0.post26, with releases closely following the upstream Triton project, often including Windows-specific bug fixes and performance enhancements.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define and launch a simple vector addition kernel using Triton. It highlights the use of `triton.jit` for kernel definition, `triton.language` for GPU operations, and integrating with PyTorch tensors. Ensure you have a CUDA-enabled GPU and PyTorch installed.

import torch
import triton
import triton.language as tl

# Define a simple Triton kernel for vector addition
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    # Ensure inputs are contiguous and on a CUDA device
    assert x.is_cuda and y.is_cuda, "Inputs must be on a CUDA device"
    assert x.shape == y.shape
    n_elements = x.numel()

    # Allocate output tensor
    output = torch.empty_like(x)

    # Calculate grid dimension based on BLOCK_SIZE
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)

    # Launch the kernel
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

if __name__ == "__main__":
    if not torch.cuda.is_available():
        print("CUDA not available. Triton requires a GPU.")
    else:
        print("CUDA is available, running Triton example...")
        size = 2**20 # 1 million elements
        x = torch.rand(size, device='cuda')
        y = torch.rand(size, device='cuda')
        output = add(x, y)
        
        # Verify correctness
        expected_output = x + y
        assert torch.allclose(output, expected_output, atol=1e-5), "Triton output mismatch!"
        print("Triton vector addition successful!")
        print("First 5 elements of Triton output:", output[:5])
        print("First 5 elements of PyTorch output:", expected_output[:5])

view raw JSON →