SGLang
SGLang is a high-performance serving framework for large language models (LLMs) and vision-language models (VLMs), implemented as a domain-specific language embedded in Python. It optimizes LLM inference through advanced techniques like RadixAttention for KV cache reuse, continuous batching, speculative decoding, and various parallelization strategies. The library supports a broad range of models from Hugging Face and offers compatibility with OpenAI APIs. SGLang maintains an active development pace with frequent, often monthly or bi-monthly, releases and is currently at version 0.5.9.
Warnings
- breaking SGLang Model Gateway v0.3.0 introduced a complete overhaul of its metrics architecture. Users of the SGLang Gateway must update Prometheus dashboards and alerting rules as metric names and structures have changed significantly.
- gotcha As of v0.5.10rc0, Piecewise CUDA graph capture is enabled by default. While this generally improves throughput and reduces memory overhead, users with specific performance tuning or those expecting non-graph execution behavior may need to re-evaluate their configurations.
- breaking SGLang v0.5.10rc0 includes a major upgrade of the `transformers` library from version 4.57.1 to 5.3.0. This significant version jump could lead to compatibility issues with custom models, tokenizers, or code relying on older `transformers` APIs.
- gotcha Users observed a significant performance regression (increased TTFT, reduced cache hit rate) between SGLang 0.5.5 and 0.5.6+ when used with Mooncake 0.3.7 under continuous benchmark workloads, despite a fix for scheduler memory growth.
- gotcha SGLang heavily relies on NVIDIA CUDA and generally requires a Linux environment for full functionality, particularly for its highly optimized kernels. While some components might run on WSL2 for Windows, macOS is typically not supported due to underlying CUDA dependencies.
Install
-
pip install sglang
Imports
- sglang
import sglang as sgl
- OpenAI client
from openai import OpenAI
Quickstart
import os
from openai import OpenAI
import time
# --- Step 1: Launch SGLang Server (Run this in a separate terminal) ---
# Command: python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000
# Note: Replace 'meta-llama/Llama-3.1-8B-Instruct' with a model you have access to
# and ensure you have logged into Hugging Face CLI if it's a gated model.
# Server output will indicate when it's ready, e.g., 'Uvicorn running on http://0.0.0.0:30000'
# --- Step 2: Interact with the SGLang server using OpenAI-compatible client ---
# Wait a moment for the server to start, or adjust the sleep duration
time.sleep(5)
client = OpenAI(
base_url=os.environ.get('SGLANG_SERVER_URL', 'http://localhost:30000/v1'),
api_key=os.environ.get('SGLANG_API_KEY', 'EMPTY') # 'EMPTY' is common for local SGLang instances
)
try:
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct", # Model name must match server's loaded model
messages=[
{"role": "user", "content": "What is the capital of France?"}
],
max_tokens=50,
stream=False
)
print("Response from SGLang server:", response.choices[0].message.content)
except Exception as e:
print(f"Error connecting to SGLang server or making request: {e}")
print("Please ensure the SGLang server is running in a separate terminal.")