SGLang Kernel Library
sgl-kernel is the core kernel library for SGLang, providing high-performance GPU-accelerated operations for LLM inference, including optimized attention, MoE routing, and CUDA graph execution. It is primarily used as a dependency of the main `sglang` library, which is currently at version `0.5.10.post1` and sees frequent updates.
Common errors
-
ModuleNotFoundError: No module named 'sglang'
cause The `sglang` library, which bundles `sgl-kernel`, is not installed.fixRun `pip install sglang` to install the main library. -
RuntimeError: No CUDA device available.
cause Python environment cannot detect an NVIDIA GPU or CUDA drivers are not properly installed/configured.fixVerify NVIDIA drivers are installed and up to date. Check `nvidia-smi` and `torch.cuda.is_available()` in Python. Ensure CUDA toolkit is compatible with your PyTorch installation. -
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate...
cause The model being loaded or the batch size/sequence length exceeds the available GPU memory.fixTry using a smaller model, reducing batch size, or decreasing `max_tokens`. Utilize quantization (e.g., 4-bit, 8-bit) if supported by the model and SGLang. Consider using a GPU with more VRAM.
Warnings
- gotcha The `sgl-kernel` PyPI package (currently 0.3.21) is often an older version than the `sgl_kernel` sub-package distributed with the main `sglang` library (currently 0.5.10.post1). For the latest features, optimizations, and compatibility, it is strongly recommended to install `sglang`.
- gotcha sgl-kernel heavily relies on NVIDIA GPUs and CUDA. Running without a compatible GPU, sufficient VRAM, and correctly installed NVIDIA drivers (including CUDA toolkit) will lead to runtime errors or prevent `sglang` from functioning.
- breaking Starting with SGLang v0.5.10, piecewise CUDA graph capture is enabled by default. While generally improving throughput and reducing memory overhead, this might subtly change performance characteristics or expose new corner cases for models with highly complex control flow. Test your applications thoroughly.
- gotcha Installation of `flashinfer` (a key dependency for `sgl-kernel`'s optimized attention) can sometimes fail due to specific CUDA version requirements or compilation issues, especially when `torch` and `flashinfer` versions are mismatched or system CUDA is not configured correctly.
Install
-
pip install sglang -
pip install sgl-kernel
Imports
- get_memory_info
from sgl_kernel.cuda_helper import get_memory_info
Quickstart
import sglang as sl
import os
os.environ['SGLANG_DEV_MODE'] = 'True' # Optional: for development features
# Launch an SGLang runtime (which utilizes sgl-kernel for execution)
runtime = sl.Runtime("openai/gpt-4o-mini") # Or your preferred local model path
@sl.function
def generate_joke(s, topic):
s += f"Give me a joke about {topic}."
s += sl.gen("joke", max_tokens=64, temperature=0.7)
# Run the function
state = runtime.run(generate_joke, topic="cats")
print(f"Joke about cats: {state['joke']}")
runtime.shutdown()