vLLM
vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLMs). It utilizes various optimization techniques, such as PagedAttention, to significantly improve LLM serving performance. Currently at version 0.19.0, vLLM maintains a rapid release cadence with frequent updates and new feature additions.
Warnings
- breaking Starting from v0.14.0, asynchronous scheduling is enabled by default. This changes the execution flow and might affect existing scripts. Some configurations like pipeline parallel, CPU backend, and certain speculative decoding methods are not yet supported with async scheduling.
- breaking vLLM v0.14.0 introduced a hard requirement for PyTorch 2.9.1 and the default wheels are compiled against CUDA 12.9. Using older PyTorch versions or incompatible CUDA versions can lead to installation failures or runtime errors.
- gotcha Users on CUDA 12.9+ may encounter `CUBLAS_STATUS_INVALID_VALUE` errors. This is often caused by a CUDA library mismatch with the installed PyTorch.
- gotcha Support for new model architectures, like Gemma 4 in v0.19.0, often requires a minimum `transformers` library version. Missing this requirement can lead to model loading failures.
- gotcha Serving Qwen3.5 models with FP8 KV cache on B200 GPUs in v0.18.0 was noted to have degraded accuracy.
Install
-
pip install vllm -
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121 -
uv pip install vllm --torch-backend=auto
Imports
- LLM
from vllm import LLM
- SamplingParams
from vllm import SamplingParams
Quickstart
import os
from vllm import LLM, SamplingParams
# For demonstration, use a small model. Replace with your desired model, e.g., 'mistralai/Mistral-7B-Instruct-v0.2'
# If the model is not found locally, vLLM will attempt to download it from Hugging Face.
# Ensure you have sufficient GPU memory for the chosen model.
model_name = os.environ.get("VLLM_MODEL", "facebook/opt-125m")
# Initialize the LLM engine
llm = LLM(model=model_name)
# Prepare prompts and sampling parameters
prompts = [
"Hello, my name is",
"The capital of France is",
"Write a short poem about a cat."
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)
# Generate text
outputs = llm.generate(prompts, sampling_params)
# Print the results
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")