{"id":588,"library":"triton","title":"Triton","description":"Triton is a language and compiler for writing highly efficient custom Deep Learning operations. It provides a Python-based programming environment for writing custom GPU kernels that can achieve performance on par with hand-tuned CUDA, but with higher productivity and flexibility than other existing DSLs. Triton aims to bridge the gap between high-level deep learning frameworks and low-level GPU programming. The current version is 3.6.0, with frequent releases (multiple major/minor releases per year).","status":"active","version":"3.6.0","language":"python","source_language":"en","source_url":"https://github.com/triton-lang/triton/","tags":["deep learning","GPU programming","compiler","custom kernels","JIT","CUDA","HIP","performance"],"install":[{"cmd":"pip install triton","lang":"bash","label":"Stable release"}],"dependencies":[],"imports":[{"symbol":"triton","correct":"import triton"},{"symbol":"triton.language as tl","correct":"import triton.language as tl"},{"symbol":"triton.jit","correct":"@triton.jit"}],"quickstart":{"code":"import triton\nimport triton.language as tl\nimport torch\n\n@triton.jit\ndef add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):\n    # Map program_id to a block of elements\n    pid = tl.program_id(axis=0)\n    block_start = pid * BLOCK_SIZE\n    offsets = block_start + tl.arange(0, BLOCK_SIZE)\n\n    # Create a mask to handle out-of-bounds accesses\n    mask = offsets < n_elements\n\n    # Load data from memory\n    x = tl.load(x_ptr + offsets, mask=mask)\n    y = tl.load(y_ptr + offsets, mask=mask)\n\n    # Perform addition\n    output = x + y\n\n    # Write back to memory\n    tl.store(output_ptr + offsets, output, mask=mask)\n\n\ndef add(x: torch.Tensor, y: torch.Tensor):\n    output = torch.empty_like(x)\n    assert x.is_cuda and y.is_cuda and output.is_cuda\n    n_elements = x.numel()\n\n    # The block size is a compile-time constant, so we can't use `n_elements`\n    # directly. Instead, we use a heuristic to choose a good block size.\n    BLOCK_SIZE = 1024 # Or adjust based on your needs\n\n    # Number of programs (blocks) to launch\n    grid = lambda META: (triton.cdiv(n_elements, META['BLOCK_SIZE']),)\n\n    # Launch the kernel\n    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=BLOCK_SIZE)\n    return output\n\nif __name__ == \"__main__\":\n    # Example usage with PyTorch tensors\n    size = 4096\n    x = torch.randn(size, device='cuda')\n    y = torch.randn(size, device='cuda')\n\n    output_triton = add(x, y)\n    output_torch = x + y\n\n    print(f\"Triton output matches PyTorch: {torch.allclose(output_triton, output_torch)}\")\n","lang":"python","description":"This quickstart demonstrates a basic vector addition kernel using Triton. It shows how to define a JIT-compiled kernel with `@triton.jit`, load and store data using `triton.language` primitives like `tl.load` and `tl.store`, and launch the kernel from Python with a specified grid size. This example processes elements in blocks, illustrating Triton's approach to GPU parallelism."},"warnings":[{"fix":"Upgrade your Python version to 3.10 or newer (but less than 3.15).","message":"Triton 3.4.0 dropped support for Python 3.8. The minimum required Python version is now 3.10, and it supports up to 3.14 (i.e., <3.15). Ensure your Python environment meets these requirements.","severity":"breaking","affected_versions":">=3.4.0"},{"fix":"Pre-compute values outside the JIT-compiled kernel or use `triton.language` math functions where available. For example, assign `log2e: tl.constexpr = 1.4426950408889634` instead of `log2e: tl.constexpr = math.log2(math.e)`.","message":"In Triton 3.0.0, the behavior of `tl.constexpr` changed. You can no longer directly call non-Triton functions (e.g., `math.log2`) within a JIT function and assign their results to `tl.constexpr` variables. These values must be pre-computed outside the kernel or implemented with `triton.language` equivalents.","severity":"breaking","affected_versions":">=3.0.0"},{"fix":"Ensure you are running on a supported Linux environment with a compatible NVIDIA GPU and the latest drivers. For Windows, use WSL2. Verify your GPU's compute capability if experiencing issues on older hardware.","message":"Triton primarily supports Linux with NVIDIA GPUs (Compute Capability 7.0 or higher, Volta generation or newer). AMD GPU support is in development. Official Windows and macOS binaries are not provided; WSL2 is the recommended workaround for Windows. An up-to-date NVIDIA driver is critical for PTX JIT compilation. Support for NVIDIA GPUs with Turing architecture (sm75, e.g., GTX 16xx/RTX 20xx) was dropped starting from Triton 3.3.","severity":"gotcha","affected_versions":"All versions, specifically >=3.3.0 for Turing drop"},{"fix":"If targeting NVIDIA GB200/GB300 GPUs, ensure you are using Triton 3.5.1 or a later version.","message":"Triton 3.5.0 introduced a bug that broke `sm103` (NVIDIA GB200/GB300) support. This was quickly patched in the 3.5.1 bug fix release.","severity":"gotcha","affected_versions":"3.5.0"},{"fix":"Use a supported GPU for `fp8` operations or consider using other data types. Some community forks or specific `triton-windows` builds might offer extended `fp8` support on older hardware.","message":"The official Triton library currently restricts `fp8` (float8) data type support to NVIDIA GPUs with compute capability >= 8.9 (e.g., RTX 40xx and newer). It is not officially supported on Ampere (RTX 30xx) or older architectures.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Monitor the `~/.triton` directory for unexpected files. For specific control over some aspects, environment variables like `TRITON_HOME` can change the root of the cache directory. Consider contributing to add more granular control over cache locations.","message":"Triton stores cache files in `~/.triton` by default. This can lead to conflicts or unexpected behavior when using different versions or forks of Triton, or when building self-contained applications. There are currently no official environment variables to override all cache-related directories.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install PyTorch in your environment using `pip install torch` (or specific instructions from pytorch.org based on your hardware and CUDA requirements).","message":"The 'torch' module is a required dependency for Triton. This error indicates that PyTorch is not installed in the environment.","severity":"breaking","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-05-12T16:22:16.601Z","next_check":"2026-06-26T00:00:00.000Z","problems":[{"fix":"pip install triton","cause":"The Triton library is not installed in the current Python environment or is not accessible via the Python path.","error":"ModuleNotFoundError: No module named 'triton'"},{"fix":"Convert `i64` or `f64` inputs to `i32` or `f32` within the kernel using `tl.cast` where possible, or ensure your specific GPU and Triton version support the desired type.","cause":"Triton kernels have limited support for `i64` or `f64` types, often preferring `i32` or `f32` for performance and compatibility across different hardware architectures.","error":"triton.compiler.code_generator.CompilationError: unsupported scalar type: i64"},{"fix":"Adjust the kernel configuration parameters (e.g., reduce `BLOCK_SIZE`, `NUM_WARPS`, or `SHARED_MEMORY_SIZE`) or provide a smaller, more feasible range for the auto-tuner to explore.","cause":"The chosen kernel configuration (e.g., block size, number of warps, shared memory usage) exceeds the available resources or limits of the GPU, or the auto-tuner couldn't find a valid launch configuration.","error":"triton.runtime.autotune.OutOfResources: Failed to launch with config"},{"fix":"pip install --upgrade setuptools","cause":"This error typically occurs during Triton installation due to an incompatibility between Triton's build system and an outdated `setuptools` version.","error":"AttributeError: 'NinjaBuildExtension' object has no attribute 'get_ext_filename'"}],"ecosystem":"pypi","meta_description":null,"install_score":50,"install_tag":"draft","quickstart_score":0,"quickstart_tag":"stale","pypi_latest":"3.7.0","install_checks":{"last_tested":"2026-05-12","tag":"draft","tag_description":"notable install failures or slow imports","results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6.8,"import_time_s":0.33,"mem_mb":10.5,"disk_size":"716M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.25,"mem_mb":10.2,"disk_size":"658M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6.5,"import_time_s":0.68,"mem_mb":11.5,"disk_size":"720M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.49,"mem_mb":11.3,"disk_size":"661M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6.5,"import_time_s":0.45,"mem_mb":11.4,"disk_size":"711M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.39,"mem_mb":11.2,"disk_size":"653M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6.2,"import_time_s":0.41,"mem_mb":11,"disk_size":"711M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.34,"mem_mb":10.7,"disk_size":"653M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":" $EXIT -eq 0 ","exit_code":1,"wheel_type":null,"failure_reason":"build_error","install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"default","exit_code":1,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":null,"mem_mb":null,"disk_size":null},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":" $EXIT -eq 0 ","exit_code":0,"wheel_type":"wheel","failure_reason":null,"install_time_s":6,"import_time_s":0.29,"mem_mb":8.6,"disk_size":"558M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"default","exit_code":0,"wheel_type":null,"failure_reason":null,"install_time_s":null,"import_time_s":0.3,"mem_mb":8.6,"disk_size":"558M"}]},"quickstart_checks":{"last_tested":"2026-04-23","tag":"stale","tag_description":"widespread failures or data too old to trust","results":[{"runtime":"python:3.10-alpine","exit_code":1},{"runtime":"python:3.10-slim","exit_code":-1},{"runtime":"python:3.11-alpine","exit_code":1},{"runtime":"python:3.11-slim","exit_code":-1},{"runtime":"python:3.12-alpine","exit_code":1},{"runtime":"python:3.12-slim","exit_code":1},{"runtime":"python:3.13-alpine","exit_code":1},{"runtime":"python:3.13-slim","exit_code":-1},{"runtime":"python:3.9-alpine","exit_code":1},{"runtime":"python:3.9-slim","exit_code":1}]}}