{"id":1778,"library":"vllm","title":"vLLM","description":"vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLMs). It utilizes various optimization techniques, such as PagedAttention, to significantly improve LLM serving performance. Currently at version 0.19.0, vLLM maintains a rapid release cadence with frequent updates and new feature additions.","status":"active","version":"0.19.0","language":"en","source_language":"en","source_url":"https://github.com/vllm-project/vllm","tags":["LLM","inference","GPU","serving","high-performance","deep-learning"],"install":[{"cmd":"pip install vllm","lang":"bash","label":"Standard installation"},{"cmd":"pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121","lang":"bash","label":"For CUDA 12.1"},{"cmd":"uv pip install vllm --torch-backend=auto","lang":"bash","label":"With uv and auto PyTorch backend"}],"dependencies":[{"reason":"Required for specific model architectures like Gemma 4, typically >=5.5.0.","package":"transformers","optional":true},{"reason":"Core deep learning framework. Specific CUDA versions (e.g., cu121, cu129) are critical for GPU compatibility.","package":"torch","optional":false}],"imports":[{"symbol":"LLM","correct":"from vllm import LLM"},{"symbol":"SamplingParams","correct":"from vllm import SamplingParams"}],"quickstart":{"code":"import os\nfrom vllm import LLM, SamplingParams\n\n# For demonstration, use a small model. Replace with your desired model, e.g., 'mistralai/Mistral-7B-Instruct-v0.2'\n# If the model is not found locally, vLLM will attempt to download it from Hugging Face.\n# Ensure you have sufficient GPU memory for the chosen model.\nmodel_name = os.environ.get(\"VLLM_MODEL\", \"facebook/opt-125m\")\n\n# Initialize the LLM engine\nllm = LLM(model=model_name)\n\n# Prepare prompts and sampling parameters\nprompts = [\n    \"Hello, my name is\",\n    \"The capital of France is\",\n    \"Write a short poem about a cat.\"\n]\nsampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)\n\n# Generate text\noutputs = llm.generate(prompts, sampling_params)\n\n# Print the results\nfor output in outputs:\n    prompt = output.prompt\n    generated_text = output.outputs[0].text\n    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")","lang":"python","description":"This quickstart demonstrates how to initialize the vLLM engine with a specified model and generate text for multiple prompts using custom sampling parameters. Ensure you have a CUDA-enabled GPU and adequate VRAM for the chosen model. The model will be downloaded from Hugging Face if not available locally."},"warnings":[{"fix":"To disable async scheduling and revert to the synchronous behavior, use the `--no-async-scheduling` flag when running the vLLM server or set `enable_async=False` in Python API calls where applicable.","message":"Starting from v0.14.0, asynchronous scheduling is enabled by default. This changes the execution flow and might affect existing scripts. Some configurations like pipeline parallel, CPU backend, and certain speculative decoding methods are not yet supported with async scheduling.","severity":"breaking","affected_versions":">=0.14.0"},{"fix":"Ensure your PyTorch installation matches the required version and CUDA compatibility. Consider using `pip install vllm --extra-index-url https://download.pytorch.org/whl/cuXX` where `XX` is your CUDA version (e.g., `cu121`, `cu122`, `cu129`).","message":"vLLM v0.14.0 introduced a hard requirement for PyTorch 2.9.1 and the default wheels are compiled against CUDA 12.9. Using older PyTorch versions or incompatible CUDA versions can lead to installation failures or runtime errors.","severity":"breaking","affected_versions":">=0.14.0"},{"fix":"Try removing system CUDA paths from `LD_LIBRARY_PATH` (e.g., `unset LD_LIBRARY_PATH`), or install vLLM with `uv pip install vllm --torch-backend=auto`, or specify the PyTorch CUDA wheel during installation (e.g., `pip install vllm --extra-index-url https://download.pytorch.org/whl/cu129`).","message":"Users on CUDA 12.9+ may encounter `CUBLAS_STATUS_INVALID_VALUE` errors. This is often caused by a CUDA library mismatch with the installed PyTorch.","severity":"gotcha","affected_versions":">=0.17.0"},{"fix":"Upgrade your `transformers` library to the recommended version. For Gemma 4, `transformers>=5.5.0` is required. Always check the release notes for specific model requirements.","message":"Support for new model architectures, like Gemma 4 in v0.19.0, often requires a minimum `transformers` library version. Missing this requirement can lead to model loading failures.","severity":"gotcha","affected_versions":">=0.19.0"},{"fix":"If encountering accuracy issues with this specific model/hardware/quantization combination, consider using a different KV cache precision (e.g., FP16) or consult later patch releases for potential fixes.","message":"Serving Qwen3.5 models with FP8 KV cache on B200 GPUs in v0.18.0 was noted to have degraded accuracy.","severity":"gotcha","affected_versions":"0.18.0"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}