{"id":4540,"library":"flashinfer-cubin","title":"Pre-compiled cubins for FlashInfer","description":"FlashInfer-cubin provides pre-compiled kernel binaries for FlashInfer, supporting a wide range of GPU architectures. This optional package for `flashinfer-python` eliminates JIT compilation and downloading overhead at runtime, leading to faster initialization and enabling offline usage. The FlashInfer project focuses on delivering high-performance LLM GPU kernels for serving and inference, maintaining an active development cycle with frequent nightly builds and regular patch releases.","status":"active","version":"0.6.7.post3","language":"en","source_language":"en","source_url":"https://github.com/flashinfer-ai/flashinfer","tags":["cuda","llm","inference","gpu","optimization","kernels","pytorch"],"install":[{"cmd":"pip install flashinfer-python flashinfer-cubin","lang":"bash","label":"Recommended installation alongside core FlashInfer"},{"cmd":"pip install flashinfer-cubin","lang":"bash","label":"Direct installation of cubin package"}],"dependencies":[{"reason":"flashinfer-cubin provides pre-compiled kernels for the core FlashInfer library. It is not functional on its own.","package":"flashinfer-python"},{"reason":"FlashInfer is built on PyTorch and requires a compatible PyTorch installation with CUDA support.","package":"torch","optional":false}],"imports":[{"note":"The `flashinfer-cubin` package is not designed for direct import by end-users. It provides pre-compiled CUDA kernel files (`.cubin` files) that the `flashinfer-python` library loads and utilizes internally for optimized performance. Users interact with the `flashinfer` library directly.","symbol":"flashinfer-cubin","correct":"N/A"}],"quickstart":{"code":"import torch\nimport flashinfer\n\n# Example of FlashInfer's single-request decode attention\n# (flashinfer-cubin provides the underlying kernels for optimal performance)\n\nkv_len = 2048\nnum_kv_heads = 32\nhead_dim = 128\n\nq = torch.randn(1, head_dim, dtype=torch.float16, device='cuda')\nk_tensor = torch.randn(kv_len, num_kv_heads, head_dim, dtype=torch.float16, device='cuda')\nv_tensor = torch.randn(kv_len, num_kv_heads, head_dim, dtype=torch.float16, device='cuda')\n\n# Prepare FlashInfer attention wrapper\nwrapper = flashinfer.to_flashinfer_decode_wrapper(\n    kv_len,\n    num_kv_heads,\n    head_dim,\n    0 # page_size, use 0 for single request\n)\n\n# Allocate KV cache\nk_cache, v_cache = wrapper.alloc_kv_cache(torch.float16, device='cuda')\n\n# Append K/V to cache (simulates historical tokens)\nwrapper.begin_forward(k_cache, v_cache)\nwrapper.end_forward()\n\n# Perform decode attention\noutput = flashinfer.single_decode_with_kv_cache(\n    q,\n    k_cache,\n    v_cache,\n    wrapper.kv_layout,\n    wrapper.kv_indices,\n    wrapper.kv_indptr,\n    wrapper.last_page_len,\n    num_kv_heads,\n    num_kv_heads, # num_query_heads == num_kv_heads for single decode\n    head_dim,\n    True # casual\n)\n\nprint(output.shape)","lang":"python","description":"This quickstart demonstrates the usage of the core `flashinfer` library for a single-request decode attention operation. When `flashinfer-cubin` is installed, it transparently provides pre-compiled CUDA kernels to `flashinfer-python`, significantly speeding up operations like this by avoiding runtime compilation overhead."},"warnings":[{"fix":"Always check the official FlashInfer documentation for supported PyTorch and CUDA versions. Use `flashinfer show-config` to verify your environment post-installation. It is recommended to install `flashinfer-python` and `flashinfer-cubin` together from PyPI to ensure compatible versions.","message":"FlashInfer, and by extension `flashinfer-cubin`, has strict compatibility requirements for CUDA and PyTorch versions. Incompatible versions can lead to runtime failures due to mismatches in precompiled kernels (e.g., CUDA 12 vs 13 toolkits) or Python library dependencies.","severity":"breaking","affected_versions":"<=0.6.x"},{"fix":"For air-gapped or restricted environments, consider using `flashinfer-jit-cache` (if available for your specific CUDA version) or pre-downloading kernels if `flashinfer-cubin` is insufficient. Monitor GitHub issues for updates on comprehensive cubin inclusion or use FlashInfer's source build for full control over compilation.","message":"`flashinfer-cubin` might not always contain all necessary pre-compiled cubins for every kernel or newer GPU architectures, especially for specific components like TRTLLM FMHA kernels. In such cases, `flashinfer-python` may attempt to download missing cubins at runtime, which can fail in isolated network environments or lead to unexpected JIT compilation.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If deploying in environments where `FLASHINFER_CUBIN_DIR` is critical, verify its behavior. It may be necessary to either prevent `flashinfer-cubin` from being installed by pip and manage cubins manually, or use `flashinfer-jit-cache` with a specified index URL for specific CUDA versions.","message":"The `FLASHINFER_CUBIN_DIR` environment variable, intended to specify a custom path for cubin files, may be ignored when `flashinfer-cubin` is installed via pip. This can lead to issues in containerized or non-root environments where explicit control over artifact paths is required.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Consult the FlashInfer documentation for a detailed breakdown of feature support per GPU architecture. Test critical workloads on your target hardware to ensure expected performance and functionality.","message":"While FlashInfer supports a wide range of NVIDIA GPU architectures (SM 7.5 'Turing' and later, up to SM 12.1 'Blackwell'), not all advanced features (e.g., FP8/FP4 operations, certain attention types) are supported across all compute capabilities. Performance can also vary significantly.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}