{"id":6477,"library":"triton-windows","title":"Triton for Windows","description":"Triton-windows is a community-maintained fork of the Triton language and compiler, specifically tailored to support Deep Learning operations on Windows. It provides a highly optimized solution for defining and launching custom GPU kernels, enabling high-performance computing in Python environments on Windows machines. The library's current version is 3.6.0.post26, with releases closely following the upstream Triton project, often including Windows-specific bug fixes and performance enhancements.","status":"active","version":"3.6.0.post26","language":"en","source_language":"en","source_url":"https://github.com/woct0rdho/triton-windows","tags":["deep learning","gpu","cuda","pytorch","windows","compiler","kernel","performance"],"install":[{"cmd":"pip install -U \"triton-windows<3.7\"","lang":"bash","label":"Install latest Triton for Windows"}],"dependencies":[{"reason":"Triton kernels interact directly with PyTorch tensors and its CUDA backend. Specific PyTorch versions are required for each Triton version.","package":"torch","optional":false}],"imports":[{"symbol":"triton","correct":"import triton"},{"symbol":"triton.language","correct":"import triton.language as tl"},{"symbol":"triton.jit","correct":"from triton import jit"}],"quickstart":{"code":"import torch\nimport triton\nimport triton.language as tl\n\n# Define a simple Triton kernel for vector addition\n@triton.jit\ndef add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):\n    pid = tl.program_id(axis=0)\n    block_start = pid * BLOCK_SIZE\n    offsets = block_start + tl.arange(0, BLOCK_SIZE)\n    mask = offsets < n_elements\n    x = tl.load(x_ptr + offsets, mask=mask)\n    y = tl.load(y_ptr + offsets, mask=mask)\n    output = x + y\n    tl.store(output_ptr + offsets, output, mask=mask)\n\ndef add(x: torch.Tensor, y: torch.Tensor):\n    # Ensure inputs are contiguous and on a CUDA device\n    assert x.is_cuda and y.is_cuda, \"Inputs must be on a CUDA device\"\n    assert x.shape == y.shape\n    n_elements = x.numel()\n\n    # Allocate output tensor\n    output = torch.empty_like(x)\n\n    # Calculate grid dimension based on BLOCK_SIZE\n    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)\n\n    # Launch the kernel\n    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)\n    return output\n\nif __name__ == \"__main__\":\n    if not torch.cuda.is_available():\n        print(\"CUDA not available. Triton requires a GPU.\")\n    else:\n        print(\"CUDA is available, running Triton example...\")\n        size = 2**20 # 1 million elements\n        x = torch.rand(size, device='cuda')\n        y = torch.rand(size, device='cuda')\n        output = add(x, y)\n        \n        # Verify correctness\n        expected_output = x + y\n        assert torch.allclose(output, expected_output, atol=1e-5), \"Triton output mismatch!\"\n        print(\"Triton vector addition successful!\")\n        print(\"First 5 elements of Triton output:\", output[:5])\n        print(\"First 5 elements of PyTorch output:\", expected_output[:5])","lang":"python","description":"This quickstart demonstrates how to define and launch a simple vector addition kernel using Triton. It highlights the use of `triton.jit` for kernel definition, `triton.language` for GPU operations, and integrating with PyTorch tensors. Ensure you have a CUDA-enabled GPU and PyTorch installed."},"warnings":[{"fix":"Always check the release notes for your desired Triton-windows version to identify the compatible PyTorch version. Ensure your PyTorch installation meets or exceeds this requirement.","message":"Each major version of Triton-windows has strict compatibility requirements with specific PyTorch versions. For example, Triton 3.6 requires PyTorch >= 2.10, Triton 3.5 requires PyTorch >= 2.9, and Triton 3.4 requires PyTorch >= 2.8. Installing a mismatched version will lead to runtime errors or incorrect behavior.","severity":"breaking","affected_versions":"All versions"},{"fix":"Use a version range in your `pip install` command, e.g., `pip install -U \"triton-windows<3.7\"` for current Triton 3.6, or `\"triton-windows<3.6\"` for Triton 3.5, etc.","message":"To prevent automatic updates of `triton-windows` from breaking compatibility with your installed PyTorch (due to the strict versioning explained above), it's highly recommended to pin the `triton-windows` version during installation.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade to `triton-windows 3.5.1-windows.post24` or later, which includes a fix to shorten cache temp paths. If upgrading isn't possible, try setting the `TMP` or `TEMP` environment variables to a short path, e.g., `C:\\Temp`.","message":"Windows' path length limit (260 characters) can cause issues with Triton's cache directory, leading to compilation failures or 'file not found' errors. This was a common problem in older versions.","severity":"gotcha","affected_versions":"<3.5.1-windows.post24"},{"fix":"Ensure you are on the latest `triton-windows` version (3.6.0.post26 or newer) for the most stable AMD GPU support. Report any specific issues on the project's GitHub repository.","message":"While initial support for AMD GPUs (with TheRock) was introduced in `3.5.1-windows.post23`, and further fixes in `3.6.0-windows.post25`, AMD GPU support is still evolving. Users might encounter specific bugs or limitations not present on NVIDIA GPUs.","severity":"gotcha","affected_versions":"All versions with AMD GPU support"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z"}