SGLang

0.5.9 · active · verified Wed Apr 01

SGLang is a high-performance serving framework for large language models (LLMs) and vision-language models (VLMs), implemented as a domain-specific language embedded in Python. It optimizes LLM inference through advanced techniques like RadixAttention for KV cache reuse, continuous batching, speculative decoding, and various parallelization strategies. The library supports a broad range of models from Hugging Face and offers compatibility with OpenAI APIs. SGLang maintains an active development pace with frequent, often monthly or bi-monthly, releases and is currently at version 0.5.9.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to interact with an SGLang server using the OpenAI Python client. First, launch the SGLang server in a separate terminal, specifying the model to serve. Then, use the provided Python script to connect to this server and send a chat completion request. Remember to replace the example model path with a valid one and handle Hugging Face authentication if using gated models.

import os
from openai import OpenAI
import time

# --- Step 1: Launch SGLang Server (Run this in a separate terminal) ---
# Command: python -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --port 30000
# Note: Replace 'meta-llama/Llama-3.1-8B-Instruct' with a model you have access to
# and ensure you have logged into Hugging Face CLI if it's a gated model.
# Server output will indicate when it's ready, e.g., 'Uvicorn running on http://0.0.0.0:30000'

# --- Step 2: Interact with the SGLang server using OpenAI-compatible client ---
# Wait a moment for the server to start, or adjust the sleep duration
time.sleep(5) 

client = OpenAI(
    base_url=os.environ.get('SGLANG_SERVER_URL', 'http://localhost:30000/v1'),
    api_key=os.environ.get('SGLANG_API_KEY', 'EMPTY') # 'EMPTY' is common for local SGLang instances
)

try:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct", # Model name must match server's loaded model
        messages=[
            {"role": "user", "content": "What is the capital of France?"}
        ],
        max_tokens=50,
        stream=False
    )
    print("Response from SGLang server:", response.choices[0].message.content)
except Exception as e:
    print(f"Error connecting to SGLang server or making request: {e}")
    print("Please ensure the SGLang server is running in a separate terminal.")

view raw JSON →