TileLang - High-Performance Kernel Development DSL
TileLang (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU/accelerator kernels, such as GEMM, Dequant GEMM, and FlashAttention. It provides a Pythonic syntax with an underlying compiler infrastructure built on Apache TVM, allowing developers to focus on productivity while achieving state-of-the-art performance. The library is actively developed, with frequent updates and nightly builds, currently at version 0.1.8.
Common errors
-
RuntimeError: Auto-tuning failed: No configuration successfully compiled and passed benchmarking/validation.
cause The autotuner could not find a valid or performant kernel configuration that successfully compiled and passed internal validation for the given parameters and hardware.fixCheck for correct `nvidia-cuda-nvcc` installation (version `>=13.0` is crucial for CUDA). Simplify the kernel or adjust autotuner search space parameters. Consult logs for specific compilation errors during the auto-tuning process. -
AttributeError: '_NestedLoopCheckVisitor' object has no attribute '_inst'
cause This is an internal bug related to the compiler's intermediate representation processing, particularly when checking nested loop structures.fixThis indicates a compiler bug. Report the specific code leading to this error on the TileLang GitHub issues page. Ensure you are on the latest patch version of TileLang. -
from tilelang import T # or similar attempt to import 'T'
cause The `T` alias for `tilelang.language` is a convention and not directly exposed by the `tilelang` top-level package for direct import.fixAlways import `tilelang.language` and assign the alias manually: `import tilelang.language as T`. -
AssertionError: Expected cuda_home to be found, which may lead to compilation bugs when utilize tilelang backend.
cause The TileLang environment setup could not locate the CUDA installation, often due to an incompatible `nvidia-cuda-nvcc` PyPI package or incorrect environment variables.fixEnsure `nvidia-cuda-nvcc>=13.0` is installed from PyPI. Verify that CUDA_HOME or a similar environment variable points to the correct CUDA toolkit installation directory if not using the PyPI package.
Warnings
- breaking The `tilelang.lower` API will be replaced by `tilelang.compile` in version 0.2.0. Existing code using `lower` will break.
- gotcha Auto-tuning can sometimes fail with 'RuntimeError: Auto-tuning failed: No configuration successfully compiled and passed benchmarking/validation.' This indicates that none of the explored configurations could be successfully compiled or validated on the target hardware.
- gotcha Inconsistent CUDA kernel generation has been reported, potentially leading to correctness failures in production despite passing tests. This suggests non-deterministic compilation behavior in certain complex scenarios.
- gotcha Layout inference for shared buffers in GEMM operations with different transpose modes can fail.
- deprecated The `primitives` folder and its design are being phased out, with functionalities merged into the `tileop` module. Direct imports or usage of `primitives` may become unstable or removed.
Install
-
pip install tilelang
Imports
- tilelang
import tilelang
- tilelang.language
import tilelang.language as T
- tilelang.jit
from tilelang import jit
@tilelang.jit
- T.prim_func
from tilelang.language import prim_func
@T.prim_func
Quickstart
import tilelang
import tilelang.language as T
import torch
@tilelang.jit
def matmul(M, N, K, block_M, block_N, block_K, dtype=T.float16, accum_dtype=T.float32, out_dtype=T.float32):
@T.prim_func
def main(
A: T.Tensor((M, K), dtype),
B: T.Tensor((K, N), dtype),
C: T.Tensor((M, N), out_dtype),
):
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):
A_shared = T.alloc_shared((block_M, block_K), dtype)
B_shared = T.alloc_shared((block_K, block_N), dtype)
C_local = T.alloc_fragment((block_M, block_N), accum_dtype)
T.clear(C_local)
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=0):
T.copy(A[by * block_M, ko * block_K], A_shared)
T.copy(B[ko * block_K, bx * block_N], B_shared)
T.gemm(A_shared, B_shared, C_local)
T.copy(C_local, C[by * block_M, bx * block_N])
return main
M = 1024
N = 1024
K = 1024
block_M = 128
block_N = 128
block_K = 64
# 1. Define the kernel (matmul) and compile/lower it into an executable module
matmul_kernel = matmul(M, N, K, block_M, block_N, block_K)
# 2. Test the kernel in Python with PyTorch data
a = torch.randn(M, K, device="cuda", dtype=torch.float16)
b = torch.randn(K, N, device="cuda", dtype=torch.float16)
c = torch.empty(M, N, device="cuda", dtype=torch.float16)
# Run the kernel
matmul_kernel(a, b, c)
# Reference multiplication using PyTorch
ref_c = (a @ b).to(c.dtype)
# Validate correctness
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
print("Kernel output matches PyTorch reference.")
# (Optional) Profile latency with kernel
# profiler = matmul_kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)
# latency = profiler.do_bench()
# print(f"Latency: {latency} ms")