{"id":8043,"library":"cuda-tile","title":"cuTile Python","description":"cuTile Python is an NVIDIA-developed Python-based Domain-Specific Language (DSL) that implements the CUDA Tile programming model. It simplifies the development of high-performance GPU kernels by abstracting away low-level thread management, allowing developers to focus on tile-based operations. The library leverages advanced hardware capabilities like Tensor Cores and Tensor Memory Accelerators, providing portability across NVIDIA GPU architectures. It is actively maintained by NVIDIA, with its current version at 1.2.0, and receives regular updates aligned with CUDA Toolkit releases.","status":"active","version":"1.2.0","language":"en","source_language":"en","source_url":"https://github.com/nvidia/cutile-python","tags":["cuda","gpu","nvidia","parallel computing","deep learning","high performance computing","compiler","dsl","machine learning"],"install":[{"cmd":"pip install cuda-tile","lang":"bash","label":"Standard installation (with system-wide CUDA Toolkit 13.1+)"},{"cmd":"pip install cuda-tile[tileiras]","lang":"bash","label":"Includes CUDA TileIR compiler dependencies if CUDA Toolkit is not system-wide"}],"dependencies":[{"reason":"Required for compilation and runtime; can be installed system-wide or via `[tileiras]` extra.","package":"CUDA Toolkit 13.1+","optional":false},{"reason":"Recommended for array operations and quickstart examples.","package":"cupy-cuda13x","optional":true},{"reason":"Used for host-side array verification in examples.","package":"numpy","optional":true},{"reason":"System driver requirement for GPU execution.","package":"NVIDIA Driver r580+","optional":false}],"imports":[{"symbol":"ct","correct":"import cuda.tile as ct"}],"quickstart":{"code":"import cuda.tile as ct\nimport cupy\nimport numpy as np\n\nTILE_SIZE = 16\n\n@ct.kernel\ndef vector_add_kernel(a, b, result):\n    block_id = ct.bid(0)\n    a_tile = ct.load(a, index=(block_id,), shape=(TILE_SIZE,))\n    b_tile = ct.load(b, index=(block_id,), shape=(TILE_SIZE,))\n    result_tile = a_tile + b_tile\n    ct.store(result, index=(block_id,), tile=result_tile)\n\n# Generate input arrays on GPU using CuPy\n# Ensure cupy-cuda13x is installed via `pip install cupy-cuda13x`\n# and CUDA Toolkit 13.1+ is available (system-wide or via [tileiras] install)\n\nif cupy.cuda.is_available():\n    rng = cupy.random.default_rng()\n    a_gpu = rng.random(128, dtype=cupy.float32)\n    b_gpu = rng.random(128, dtype=cupy.float32)\n    expected_np = cupy.asnumpy(a_gpu) + cupy.asnumpy(b_gpu)\n\n    # Allocate an output array on GPU\n    result_gpu = cupy.zeros_like(a_gpu)\n\n    # Launch the kernel\n    grid = (ct.cdiv(a_gpu.shape[0], TILE_SIZE), 1, 1)\n    ct.launch(cupy.cuda.get_current_stream(), grid, vector_add_kernel, (a_gpu, b_gpu, result_gpu))\n\n    # Verify the results\n    result_np = cupy.asnumpy(result_gpu)\n    np.testing.assert_array_almost_equal(result_np, expected_np)\n    print(\"Vector addition successful!\")\nelse:\n    print(\"CUDA is not available. Cannot run CuPy example.\")","lang":"python","description":"This example demonstrates how to define and launch a simple vector addition kernel using `cuda-tile` with CuPy. It showcases loading tiles from global memory, performing operations on them, and storing the result back. This pattern is fundamental to cuTile kernel development."},"warnings":[{"fix":"Migrate to the new element type wrappers (Int8, Int32, Float16, Float32, etc.) directly defined in `cuda_tile_ops.py` for element type handling. Rewrite any logic relying on the removed utilities.","message":"The `StringType` (cuda_tile.string) and associated bytecode support, along with `cuda_tile_utils.py` (containing `mutex_synchronize` and `printf_sync_tile`), were removed.","severity":"breaking","affected_versions":"Early 1.x versions of cuda-tile (e.g., between pre-releases and 1.0.0, or 1.0.0 and 1.1.0)."},{"fix":"Update your NVIDIA GPU drivers to version r580 or newer. Check `nvidia-smi` for your current driver version.","message":"cuTile Python requires NVIDIA Driver r580 or later to run. Older drivers will prevent cuTile kernels from executing correctly or at all.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure CUDA Toolkit 13.1+ is installed on your system, or install `cuda-tile` using `pip install cuda-tile[tileiras]` which brings in necessary compiler components into your Python environment. For Debian-based systems, `apt-get install cuda-tileiras-13.1 cuda-compiler-13.1` can be used instead of the full toolkit.","message":"The CUDA Toolkit 13.1+ is a mandatory prerequisite. If not installed system-wide, the `cuda-tile[tileiras]` installation option must be used. Failure to meet this requirement will result in compilation or runtime errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade to CUDA Toolkit 13.2+ and the latest `cuda-tile` package (1.2.0+) which extends full support to Ampere and Ada Lovelace architectures.","message":"Initial CUDA Toolkit 13.1 (and corresponding `cuda-tile` versions) had the `tileiras` compiler only supporting Blackwell GPUs. While this restriction is being removed in later versions, users on 13.1 with older architectures (Ampere, Ada Lovelace, Hopper) might experience limited or no functionality/performance.","severity":"gotcha","affected_versions":"Versions of `cuda-tile` compiled with CUDA Toolkit 13.1. Compute capability 8.x, 9.x GPUs."},{"fix":"Ensure all tile dimensions (e.g., in `shape` arguments to `ct.load`) are literal integer powers of two (e.g., 16, 32, 64).","message":"Tile dimensions in cuTile kernels must be compile-time constants and powers of two for optimal hardware mapping. Dynamic or non-power-of-two tile sizes can lead to `TileValueError` or `TileUnsupportedFeatureError` or suboptimal performance.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install cuda-tile` or `pip install cuda-tile[tileiras]` to install the package. If using a virtual environment, ensure it is activated.","cause":"`cuda-tile` Python package is not installed or the Python environment is not correctly activated.","error":"ModuleNotFoundError: No module named 'cuda.tile'"},{"fix":"Verify that CUDA Toolkit 13.1+ is correctly installed and its `bin` directory (containing `tileiras`) is in your system's PATH. If using `cuda-tile[tileiras]`, ensure package versions are consistent. Check GPU driver is r580+. Set `CUDA_TILE_ENABLE_CRASH_DUMP=1` for detailed logs.","cause":"The underlying `tileiras` compiler (part of CUDA Toolkit) encountered an error during kernel compilation, or it's not found in the PATH. This often indicates an issue with the CUDA Toolkit installation or an unsupported GPU/driver combination.","error":"cuda.tile.TileCompilerExecutionError: TileIR compiler 'tileiras' failed to compile kernel."},{"fix":"Review the types of all variables and parameters within your kernel. Ensure they are compatible with cuTile's type system (e.g., basic numeric types, CuPy/PyTorch arrays for host-side arguments). If type annotations are used, ensure they are correct.","cause":"A Python variable or expression within a `ct.kernel` decorated function used an unsupported type or data type for GPU operations, or an explicit type annotation does not match usage.","error":"cuda.tile.TileTypeError: Unexpected type or data type in kernel."},{"fix":"Ensure an NVIDIA GPU is present and functioning. Install or update NVIDIA drivers to r580+. Reinstall `cupy` (e.g., `pip install cupy-cuda13x`) to match your CUDA Toolkit version. Confirm CUDA Toolkit 13.1+ is installed and accessible.","cause":"The system lacks a compatible NVIDIA GPU, or the CUDA drivers are not correctly installed/loaded, or the `cupy` installation does not match the available CUDA version.","error":"RuntimeError: CUDA error: no CUDA-capable device is detected"}]}