{"id":2028,"library":"flashinfer-python","title":"FlashInfer: Kernel Library for LLM Serving","description":"FlashInfer is a high-performance kernel library for optimizing Large Language Model (LLM) inference on NVIDIA GPUs. It provides efficient CUDA kernels for operations like paged attention, prefill, and decode. Currently at version 0.6.7.post3, the library is under active development with frequent patch releases and nightly builds, indicating rapid evolution and potential API changes.","status":"active","version":"0.6.7.post3","language":"en","source_language":"en","source_url":"https://github.com/flashinfer-ai/flashinfer","tags":["LLM","inference","CUDA","GPU","attention","AI","performance"],"install":[{"cmd":"pip install flashinfer-python","lang":"bash","label":"Stable Release"},{"cmd":"pip install flashinfer-python --pre --extra-index-url https://flashinfer.ai/whl/cu121","lang":"bash","label":"Nightly Build (example for CUDA 12.1)"}],"dependencies":[{"reason":"FlashInfer kernels integrate with PyTorch tensors; it's a direct dependency for array manipulation and device management.","package":"torch"}],"imports":[{"symbol":"flashinfer","correct":"import flashinfer as fi"},{"symbol":"BatchDecodeWithPagedKVCache","correct":"from flashinfer import BatchDecodeWithPagedKVCache"},{"symbol":"BatchPrefillWithRaggedKVCache","correct":"from flashinfer import BatchPrefillWithRaggedKVCache"},{"symbol":"PagedKVCache","correct":"from flashinfer.core import PagedKVCache"}],"quickstart":{"code":"import torch\nimport flashinfer as fi\n\n# Ensure CUDA is available\nif not torch.cuda.is_available():\n    raise RuntimeError(\"CUDA is not available. FlashInfer requires a CUDA-enabled GPU.\")\n\n# Device and dtype\ndevice = \"cuda\"\ndtype = torch.float16\n\n# Model parameters (simplified for example)\nnum_layers = 1 # In real models, usually multiple\nnum_heads = 32\nkv_heads = 32\nhead_dim = 128\npage_size = 16\nmax_total_seq_len = 2048 # Max tokens in KV cache across all sequences\n\n# 1. Initialize PagedKVCache\n# This manages the memory for key/value states on the GPU\nkv_cache = fi.core.PagedKVCache(\n    num_layers=num_layers,\n    num_kv_heads=kv_heads,\n    head_dim=head_dim,\n    page_size=page_size,\n    max_num_pages=max_total_seq_len // page_size,\n    device=device,\n    data_type=dtype,\n)\n\n# 2. Create a BatchDecodeWithPagedKVCache wrapper\n# This object prepares the inputs for the underlying attention kernels\ndecode_wrapper = fi.BatchDecodeWithPagedKVCache(\n    kv_cache=kv_cache,\n    num_heads=num_heads,\n    kv_heads=kv_heads,\n    head_dim=head_dim,\n    sm_scale=1.0 / (head_dim**0.5), # Standard attention scale\n    dtype=dtype,\n)\n\n# 3. Simulate adding a sequence to the cache (prefill step)\n# In a real LLM serving scenario, this would populate the cache with initial tokens.\nbatch_size_decode = 1 # Decoding one sequence\nprefill_len = 50 # Length of the sequence already in cache\n\n# Allocate pages for a new sequence of `prefill_len`\nseq_idx_in_batch = kv_cache.begin_forward(prefill_len)\n# In a real application, you'd populate kv_cache with actual K/V from a prefill operation.\n# For this example, we just simulate the cache being 'ready' for decode.\nkv_cache.end_forward(seq_idx_in_batch, prefill_len) # Commits the pages, making the sequence ready for decode.\n\n# 4. Prepare query tensor for decode\n# Query for the next token, shape (batch_size, 1, num_heads, head_dim)\nquery_decode = torch.randn(batch_size_decode, 1, num_heads, head_dim, dtype=dtype, device=device)\n\n# 5. Perform the decode operation for the next token\noutput = decode_wrapper.decode(query_decode)\n\nprint(f\"FlashInfer BatchDecode output shape: {output.shape}\")\nprint(\"FlashInfer decode successful.\")\n","lang":"python","description":"This quickstart demonstrates how to set up `PagedKVCache` and use `BatchDecodeWithPagedKVCache` to perform a single-token decode operation. It simulates a sequence already present in the cache and then processes a new query token, highlighting the typical workflow for LLM inference."},"warnings":[{"fix":"Ensure your system has a compatible NVIDIA GPU. When installing, verify that the `flashinfer-python` wheel's CUDA version tag (+cuXXX) matches your installed CUDA toolkit, or build from source if no matching wheel is available.","message":"FlashInfer is a kernel library requiring an NVIDIA GPU with a compatible CUDA runtime. It will not work on CPUs or other accelerators. Using pre-built wheels requires a matching CUDA toolkit version (e.g., cu118 for CUDA 11.8); a mismatch often leads to `RuntimeError` or `ModuleNotFoundError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Pin `flashinfer-python` to a specific version (`flashinfer-python==x.y.z`) to prevent unexpected breakage. Always refer to the official GitHub repository's README and examples for the latest API usage patterns when upgrading.","message":"The library is under active development and not yet at a 1.0 release. Frequent minor and patch releases (including nightly builds) may introduce API changes or breaking modifications to function signatures and class constructors.","severity":"breaking","affected_versions":"All versions prior to 1.0"},{"fix":"Thoroughly review official documentation and examples for `PagedKVCache` and high-level wrappers like `BatchDecodeWithPagedKVCache` to ensure correct initialization, sequence management, and tensor data handling.","message":"FlashInfer's API, particularly for `PagedKVCache` and attention wrappers, is relatively low-level. Incorrect setup of internal metadata, page tables, or buffer management can lead to subtle bugs, incorrect attention calculations, or memory access violations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Check FlashInfer's documentation or GitHub issues for recommended PyTorch versions. If encountering issues, try aligning PyTorch with the CUDA version targeted by your FlashInfer installation (e.g., `pip install torch==2.x.x+cuXXX`).","message":"FlashInfer is tightly coupled with PyTorch for tensor operations and device management. While `torch` is a dependency, ensure your PyTorch version is compatible, especially when using specific CUDA versions or pre-built FlashInfer wheels.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}