{"id":14916,"library":"sgl-kernel","title":"SGLang Kernel Library","description":"sgl-kernel is the core kernel library for SGLang, providing high-performance GPU-accelerated operations for LLM inference, including optimized attention, MoE routing, and CUDA graph execution. It is primarily used as a dependency of the main `sglang` library, which is currently at version `0.5.10.post1` and sees frequent updates.","status":"active","version":"0.5.10.post1","language":"en","source_language":"en","source_url":"https://github.com/sgl-project/sglang/tree/main/sgl-kernel","tags":["llm","inference","gpu","cuda","attention","pytorch","moe"],"install":[{"cmd":"pip install sglang","lang":"bash","label":"Recommended for latest features (installs sgl-kernel as a dependency)"},{"cmd":"pip install sgl-kernel","lang":"bash","label":"Installs sgl-kernel 0.3.21 directly (may be an older version)"}],"dependencies":[{"reason":"Core deep learning framework dependency for GPU operations.","package":"torch"},{"reason":"Used for model loading and tokenizer functionalities.","package":"transformers"},{"reason":"Optimized attention kernel for enhanced performance.","package":"flashinfer","optional":false}],"imports":[{"note":"sgl-kernel primarily provides low-level, internal components for `sglang`. Direct imports by end-users are rare, but utilities like `get_memory_info` from `cuda_helper` might be used for debugging or advanced scenarios. Most users interact with `sglang` directly.","symbol":"get_memory_info","correct":"from sgl_kernel.cuda_helper import get_memory_info"}],"quickstart":{"code":"import sglang as sl\nimport os\n\nos.environ['SGLANG_DEV_MODE'] = 'True' # Optional: for development features\n\n# Launch an SGLang runtime (which utilizes sgl-kernel for execution)\nruntime = sl.Runtime(\"openai/gpt-4o-mini\") # Or your preferred local model path\n\n@sl.function\ndef generate_joke(s, topic):\n    s += f\"Give me a joke about {topic}.\"\n    s += sl.gen(\"joke\", max_tokens=64, temperature=0.7)\n\n# Run the function\nstate = runtime.run(generate_joke, topic=\"cats\")\n\nprint(f\"Joke about cats: {state['joke']}\")\n\nruntime.shutdown()","lang":"python","description":"This quickstart demonstrates how to use `sglang`, the main library that leverages `sgl-kernel` for its high-performance execution. It shows a simple LLM generation task. Note that `sgl-kernel` itself does not expose a high-level API for direct user interaction; its functionality is accessed through `sglang`."},"warnings":[{"fix":"Install `sglang` via `pip install sglang` to get the up-to-date kernel components.","message":"The `sgl-kernel` PyPI package (currently 0.3.21) is often an older version than the `sgl_kernel` sub-package distributed with the main `sglang` library (currently 0.5.10.post1). For the latest features, optimizations, and compatibility, it is strongly recommended to install `sglang`.","severity":"gotcha","affected_versions":"<0.5.10.post1 (sgl-kernel PyPI)"},{"fix":"Ensure you have an NVIDIA GPU, up-to-date drivers, and CUDA toolkit installed. Check `torch.cuda.is_available()`.","message":"sgl-kernel heavily relies on NVIDIA GPUs and CUDA. Running without a compatible GPU, sufficient VRAM, and correctly installed NVIDIA drivers (including CUDA toolkit) will lead to runtime errors or prevent `sglang` from functioning.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Monitor memory usage and throughput after upgrading. If issues arise, consult SGLang documentation for potential configuration options to adjust CUDA graph behavior.","message":"Starting with SGLang v0.5.10, piecewise CUDA graph capture is enabled by default. While generally improving throughput and reducing memory overhead, this might subtly change performance characteristics or expose new corner cases for models with highly complex control flow. Test your applications thoroughly.","severity":"breaking","affected_versions":">=0.5.10 (of sglang)"},{"fix":"Ensure your `torch` version is compatible with `flashinfer`'s requirements. Check `flashinfer`'s GitHub for specific CUDA/Torch version matrices. Often, upgrading `pip` and reinstalling `flashinfer` from source or pre-built wheels helps.","message":"Installation of `flashinfer` (a key dependency for `sgl-kernel`'s optimized attention) can sometimes fail due to specific CUDA version requirements or compilation issues, especially when `torch` and `flashinfer` versions are mismatched or system CUDA is not configured correctly.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install sglang` to install the main library.","cause":"The `sglang` library, which bundles `sgl-kernel`, is not installed.","error":"ModuleNotFoundError: No module named 'sglang'"},{"fix":"Verify NVIDIA drivers are installed and up to date. Check `nvidia-smi` and `torch.cuda.is_available()` in Python. Ensure CUDA toolkit is compatible with your PyTorch installation.","cause":"Python environment cannot detect an NVIDIA GPU or CUDA drivers are not properly installed/configured.","error":"RuntimeError: No CUDA device available."},{"fix":"Try using a smaller model, reducing batch size, or decreasing `max_tokens`. Utilize quantization (e.g., 4-bit, 8-bit) if supported by the model and SGLang. Consider using a GPU with more VRAM.","cause":"The model being loaded or the batch size/sequence length exceeds the available GPU memory.","error":"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate..."}],"ecosystem":"pypi"}