Pre-compiled cubins for FlashInfer
FlashInfer-cubin provides pre-compiled kernel binaries for FlashInfer, supporting a wide range of GPU architectures. This optional package for `flashinfer-python` eliminates JIT compilation and downloading overhead at runtime, leading to faster initialization and enabling offline usage. The FlashInfer project focuses on delivering high-performance LLM GPU kernels for serving and inference, maintaining an active development cycle with frequent nightly builds and regular patch releases.
Warnings
- breaking FlashInfer, and by extension `flashinfer-cubin`, has strict compatibility requirements for CUDA and PyTorch versions. Incompatible versions can lead to runtime failures due to mismatches in precompiled kernels (e.g., CUDA 12 vs 13 toolkits) or Python library dependencies.
- gotcha `flashinfer-cubin` might not always contain all necessary pre-compiled cubins for every kernel or newer GPU architectures, especially for specific components like TRTLLM FMHA kernels. In such cases, `flashinfer-python` may attempt to download missing cubins at runtime, which can fail in isolated network environments or lead to unexpected JIT compilation.
- gotcha The `FLASHINFER_CUBIN_DIR` environment variable, intended to specify a custom path for cubin files, may be ignored when `flashinfer-cubin` is installed via pip. This can lead to issues in containerized or non-root environments where explicit control over artifact paths is required.
- gotcha While FlashInfer supports a wide range of NVIDIA GPU architectures (SM 7.5 'Turing' and later, up to SM 12.1 'Blackwell'), not all advanced features (e.g., FP8/FP4 operations, certain attention types) are supported across all compute capabilities. Performance can also vary significantly.
Install
-
pip install flashinfer-python flashinfer-cubin -
pip install flashinfer-cubin
Imports
- flashinfer-cubin
N/A
Quickstart
import torch
import flashinfer
# Example of FlashInfer's single-request decode attention
# (flashinfer-cubin provides the underlying kernels for optimal performance)
kv_len = 2048
num_kv_heads = 32
head_dim = 128
q = torch.randn(1, head_dim, dtype=torch.float16, device='cuda')
k_tensor = torch.randn(kv_len, num_kv_heads, head_dim, dtype=torch.float16, device='cuda')
v_tensor = torch.randn(kv_len, num_kv_heads, head_dim, dtype=torch.float16, device='cuda')
# Prepare FlashInfer attention wrapper
wrapper = flashinfer.to_flashinfer_decode_wrapper(
kv_len,
num_kv_heads,
head_dim,
0 # page_size, use 0 for single request
)
# Allocate KV cache
k_cache, v_cache = wrapper.alloc_kv_cache(torch.float16, device='cuda')
# Append K/V to cache (simulates historical tokens)
wrapper.begin_forward(k_cache, v_cache)
wrapper.end_forward()
# Perform decode attention
output = flashinfer.single_decode_with_kv_cache(
q,
k_cache,
v_cache,
wrapper.kv_layout,
wrapper.kv_indices,
wrapper.kv_indptr,
wrapper.last_page_len,
num_kv_heads,
num_kv_heads, # num_query_heads == num_kv_heads for single decode
head_dim,
True # casual
)
print(output.shape)