cuTile Python
cuTile Python is an NVIDIA-developed Python-based Domain-Specific Language (DSL) that implements the CUDA Tile programming model. It simplifies the development of high-performance GPU kernels by abstracting away low-level thread management, allowing developers to focus on tile-based operations. The library leverages advanced hardware capabilities like Tensor Cores and Tensor Memory Accelerators, providing portability across NVIDIA GPU architectures. It is actively maintained by NVIDIA, with its current version at 1.2.0, and receives regular updates aligned with CUDA Toolkit releases.
Common errors
-
ModuleNotFoundError: No module named 'cuda.tile'
cause `cuda-tile` Python package is not installed or the Python environment is not correctly activated.fixRun `pip install cuda-tile` or `pip install cuda-tile[tileiras]` to install the package. If using a virtual environment, ensure it is activated. -
cuda.tile.TileCompilerExecutionError: TileIR compiler 'tileiras' failed to compile kernel.
cause The underlying `tileiras` compiler (part of CUDA Toolkit) encountered an error during kernel compilation, or it's not found in the PATH. This often indicates an issue with the CUDA Toolkit installation or an unsupported GPU/driver combination.fixVerify that CUDA Toolkit 13.1+ is correctly installed and its `bin` directory (containing `tileiras`) is in your system's PATH. If using `cuda-tile[tileiras]`, ensure package versions are consistent. Check GPU driver is r580+. Set `CUDA_TILE_ENABLE_CRASH_DUMP=1` for detailed logs. -
cuda.tile.TileTypeError: Unexpected type or data type in kernel.
cause A Python variable or expression within a `ct.kernel` decorated function used an unsupported type or data type for GPU operations, or an explicit type annotation does not match usage.fixReview the types of all variables and parameters within your kernel. Ensure they are compatible with cuTile's type system (e.g., basic numeric types, CuPy/PyTorch arrays for host-side arguments). If type annotations are used, ensure they are correct. -
RuntimeError: CUDA error: no CUDA-capable device is detected
cause The system lacks a compatible NVIDIA GPU, or the CUDA drivers are not correctly installed/loaded, or the `cupy` installation does not match the available CUDA version.fixEnsure an NVIDIA GPU is present and functioning. Install or update NVIDIA drivers to r580+. Reinstall `cupy` (e.g., `pip install cupy-cuda13x`) to match your CUDA Toolkit version. Confirm CUDA Toolkit 13.1+ is installed and accessible.
Warnings
- breaking The `StringType` (cuda_tile.string) and associated bytecode support, along with `cuda_tile_utils.py` (containing `mutex_synchronize` and `printf_sync_tile`), were removed.
- gotcha cuTile Python requires NVIDIA Driver r580 or later to run. Older drivers will prevent cuTile kernels from executing correctly or at all.
- gotcha The CUDA Toolkit 13.1+ is a mandatory prerequisite. If not installed system-wide, the `cuda-tile[tileiras]` installation option must be used. Failure to meet this requirement will result in compilation or runtime errors.
- gotcha Initial CUDA Toolkit 13.1 (and corresponding `cuda-tile` versions) had the `tileiras` compiler only supporting Blackwell GPUs. While this restriction is being removed in later versions, users on 13.1 with older architectures (Ampere, Ada Lovelace, Hopper) might experience limited or no functionality/performance.
- gotcha Tile dimensions in cuTile kernels must be compile-time constants and powers of two for optimal hardware mapping. Dynamic or non-power-of-two tile sizes can lead to `TileValueError` or `TileUnsupportedFeatureError` or suboptimal performance.
Install
-
pip install cuda-tile -
pip install cuda-tile[tileiras]
Imports
- ct
import cuda.tile as ct
Quickstart
import cuda.tile as ct
import cupy
import numpy as np
TILE_SIZE = 16
@ct.kernel
def vector_add_kernel(a, b, result):
block_id = ct.bid(0)
a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))
b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))
result_tile = a_tile + b_tile
ct.store(result, index=(block_id,), tile=result_tile)
# Generate input arrays on GPU using CuPy
# Ensure cupy-cuda13x is installed via `pip install cupy-cuda13x`
# and CUDA Toolkit 13.1+ is available (system-wide or via [tileiras] install)
if cupy.cuda.is_available():
rng = cupy.random.default_rng()
a_gpu = rng.random(128, dtype=cupy.float32)
b_gpu = rng.random(128, dtype=cupy.float32)
expected_np = cupy.asnumpy(a_gpu) + cupy.asnumpy(b_gpu)
# Allocate an output array on GPU
result_gpu = cupy.zeros_like(a_gpu)
# Launch the kernel
grid = (ct.cdiv(a_gpu.shape[0], TILE_SIZE), 1, 1)
ct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a_gpu, b_gpu, result_gpu))
# Verify the results
result_np = cupy.asnumpy(result_gpu)
np.testing.assert_array_almost_equal(result_np, expected_np)
print("Vector addition successful!")
else:
print("CUDA is not available. Cannot run CuPy example.")