cuTile Python

1.2.0 · active · verified Thu Apr 16

cuTile Python is an NVIDIA-developed Python-based Domain-Specific Language (DSL) that implements the CUDA Tile programming model. It simplifies the development of high-performance GPU kernels by abstracting away low-level thread management, allowing developers to focus on tile-based operations. The library leverages advanced hardware capabilities like Tensor Cores and Tensor Memory Accelerators, providing portability across NVIDIA GPU architectures. It is actively maintained by NVIDIA, with its current version at 1.2.0, and receives regular updates aligned with CUDA Toolkit releases.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates how to define and launch a simple vector addition kernel using `cuda-tile` with CuPy. It showcases loading tiles from global memory, performing operations on them, and storing the result back. This pattern is fundamental to cuTile kernel development.

import cuda.tile as ct
import cupy
import numpy as np

TILE_SIZE = 16

@ct.kernel
def vector_add_kernel(a, b, result):
    block_id = ct.bid(0)
    a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))
    b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))
    result_tile = a_tile + b_tile
    ct.store(result, index=(block_id,), tile=result_tile)

# Generate input arrays on GPU using CuPy
# Ensure cupy-cuda13x is installed via `pip install cupy-cuda13x`
# and CUDA Toolkit 13.1+ is available (system-wide or via [tileiras] install)

if cupy.cuda.is_available():
    rng = cupy.random.default_rng()
    a_gpu = rng.random(128, dtype=cupy.float32)
    b_gpu = rng.random(128, dtype=cupy.float32)
    expected_np = cupy.asnumpy(a_gpu) + cupy.asnumpy(b_gpu)

    # Allocate an output array on GPU
    result_gpu = cupy.zeros_like(a_gpu)

    # Launch the kernel
    grid = (ct.cdiv(a_gpu.shape[0], TILE_SIZE), 1, 1)
    ct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a_gpu, b_gpu, result_gpu))

    # Verify the results
    result_np = cupy.asnumpy(result_gpu)
    np.testing.assert_array_almost_equal(result_np, expected_np)
    print("Vector addition successful!")
else:
    print("CUDA is not available. Cannot run CuPy example.")

view raw JSON →