{"id":8714,"library":"tilelang","title":"TileLang - High-Performance Kernel Development DSL","description":"TileLang (tile-lang) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU/accelerator kernels, such as GEMM, Dequant GEMM, and FlashAttention. It provides a Pythonic syntax with an underlying compiler infrastructure built on Apache TVM, allowing developers to focus on productivity while achieving state-of-the-art performance. The library is actively developed, with frequent updates and nightly builds, currently at version 0.1.8.","status":"active","version":"0.1.8","language":"en","source_language":"en","source_url":"https://github.com/tile-ai/tilelang","tags":["AI","GPU","CPU","compiler","DSL","high-performance computing","kernel","TVM"],"install":[{"cmd":"pip install tilelang","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Underlying compiler infrastructure for code generation.","package":"apache-tvm","optional":false},{"reason":"Required for PyTorch integration and tensor operations.","package":"torch-c-dlpack-ext","optional":false},{"reason":"Required for CUDA backend compilation; version >=13.0 recommended.","package":"nvidia-cuda-nvcc","optional":true}],"imports":[{"symbol":"tilelang","correct":"import tilelang"},{"symbol":"tilelang.language","correct":"import tilelang.language as T"},{"note":"The `@tilelang.jit` decorator is typically imported directly from the top-level package or used as `tilelang.jit`.","wrong":"from tilelang import jit","symbol":"tilelang.jit","correct":"@tilelang.jit"},{"note":"Functions like `prim_func` are typically accessed via the `tilelang.language` alias `T`.","wrong":"from tilelang.language import prim_func","symbol":"T.prim_func","correct":"@T.prim_func"}],"quickstart":{"code":"import tilelang\nimport tilelang.language as T\nimport torch\n\n@tilelang.jit\ndef matmul(M, N, K, block_M, block_N, block_K, dtype=T.float16, accum_dtype=T.float32, out_dtype=T.float32):\n    @T.prim_func\n    def main(\n        A: T.Tensor((M, K), dtype),\n        B: T.Tensor((K, N), dtype),\n        C: T.Tensor((M, N), out_dtype),\n    ):\n        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):\n            A_shared = T.alloc_shared((block_M, block_K), dtype)\n            B_shared = T.alloc_shared((block_K, block_N), dtype)\n            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)\n            T.clear(C_local)\n\n            for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=0):\n                T.copy(A[by * block_M, ko * block_K], A_shared)\n                T.copy(B[ko * block_K, bx * block_N], B_shared)\n                T.gemm(A_shared, B_shared, C_local)\n\n            T.copy(C_local, C[by * block_M, bx * block_N])\n    return main\n\nM = 1024\nN = 1024\nK = 1024\nblock_M = 128\nblock_N = 128\nblock_K = 64\n\n# 1. Define the kernel (matmul) and compile/lower it into an executable module\nmatmul_kernel = matmul(M, N, K, block_M, block_N, block_K)\n\n# 2. Test the kernel in Python with PyTorch data\na = torch.randn(M, K, device=\"cuda\", dtype=torch.float16)\nb = torch.randn(K, N, device=\"cuda\", dtype=torch.float16)\nc = torch.empty(M, N, device=\"cuda\", dtype=torch.float16)\n\n# Run the kernel\nmatmul_kernel(a, b, c)\n\n# Reference multiplication using PyTorch\nref_c = (a @ b).to(c.dtype)\n\n# Validate correctness\ntorch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)\nprint(\"Kernel output matches PyTorch reference.\")\n\n# (Optional) Profile latency with kernel\n# profiler = matmul_kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)\n# latency = profiler.do_bench()\n# print(f\"Latency: {latency} ms\")","lang":"python","description":"This quickstart demonstrates how to define and execute a matrix multiplication (GEMM) kernel using TileLang, integrating with PyTorch for tensor management and validation. It showcases decorators like `@tilelang.jit` and `@T.prim_func`, memory allocation with `T.alloc_shared`, data movement with `T.copy`, matrix multiplication with `T.gemm`, and loop pipelining with `T.Pipelined`."},"warnings":[{"fix":"Update calls from `tilelang.lower(...)` to `tilelang.compile(...)`.","message":"The `tilelang.lower` API will be replaced by `tilelang.compile` in version 0.2.0. Existing code using `lower` will break.","severity":"breaking","affected_versions":">=0.2.0"},{"fix":"Review the auto-tuning parameters, kernel definition, and target hardware environment. Ensure dependencies like `nvidia-cuda-nvcc` are correctly installed and meet version requirements (e.g., `>=13.0`).","message":"Auto-tuning can sometimes fail with 'RuntimeError: Auto-tuning failed: No configuration successfully compiled and passed benchmarking/validation.' This indicates that none of the explored configurations could be successfully compiled or validated on the target hardware.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Thoroughly validate kernels across various inputs and environments, including production-like setups. Monitor GitHub issues for updates and potential hotfixes related to compiler determinism.","message":"Inconsistent CUDA kernel generation has been reported, potentially leading to correctness failures in production despite passing tests. This suggests non-deterministic compilation behavior in certain complex scenarios.","severity":"gotcha","affected_versions":"All versions (observed in 0.1.x)"},{"fix":"Carefully review and test kernels that use shared buffers across multiple GEMM operations with varying transpose configurations. Consider explicit layout annotations if automatic inference proves problematic.","message":"Layout inference for shared buffers in GEMM operations with different transpose modes can fail.","severity":"gotcha","affected_versions":"All versions (observed in 0.1.x)"},{"fix":"Migrate any usage of modules or functions from the `primitives` folder to their equivalents in the `tileop` module. Consult the latest GitHub repository for the correct paths.","message":"The `primitives` folder and its design are being phased out, with functionalities merged into the `tileop` module. Direct imports or usage of `primitives` may become unstable or removed.","severity":"deprecated","affected_versions":">=0.1.8 (gradual removal)"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Check for correct `nvidia-cuda-nvcc` installation (version `>=13.0` is crucial for CUDA). Simplify the kernel or adjust autotuner search space parameters. Consult logs for specific compilation errors during the auto-tuning process.","cause":"The autotuner could not find a valid or performant kernel configuration that successfully compiled and passed internal validation for the given parameters and hardware.","error":"RuntimeError: Auto-tuning failed: No configuration successfully compiled and passed benchmarking/validation."},{"fix":"This indicates a compiler bug. Report the specific code leading to this error on the TileLang GitHub issues page. Ensure you are on the latest patch version of TileLang.","cause":"This is an internal bug related to the compiler's intermediate representation processing, particularly when checking nested loop structures.","error":"AttributeError: '_NestedLoopCheckVisitor' object has no attribute '_inst'"},{"fix":"Always import `tilelang.language` and assign the alias manually: `import tilelang.language as T`.","cause":"The `T` alias for `tilelang.language` is a convention and not directly exposed by the `tilelang` top-level package for direct import.","error":"from tilelang import T # or similar attempt to import 'T'"},{"fix":"Ensure `nvidia-cuda-nvcc>=13.0` is installed from PyPI. Verify that CUDA_HOME or a similar environment variable points to the correct CUDA toolkit installation directory if not using the PyPI package.","cause":"The TileLang environment setup could not locate the CUDA installation, often due to an incompatible `nvidia-cuda-nvcc` PyPI package or incorrect environment variables.","error":"AssertionError: Expected cuda_home to be found, which may lead to compilation bugs when utilize tilelang backend."}]}