vLLM

0.19.0 · active · verified Thu Apr 09

vLLM is a high-throughput and memory-efficient inference and serving engine for large language models (LLMs). It utilizes various optimization techniques, such as PagedAttention, to significantly improve LLM serving performance. Currently at version 0.19.0, vLLM maintains a rapid release cadence with frequent updates and new feature additions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the vLLM engine with a specified model and generate text for multiple prompts using custom sampling parameters. Ensure you have a CUDA-enabled GPU and adequate VRAM for the chosen model. The model will be downloaded from Hugging Face if not available locally.

import os
from vllm import LLM, SamplingParams

# For demonstration, use a small model. Replace with your desired model, e.g., 'mistralai/Mistral-7B-Instruct-v0.2'
# If the model is not found locally, vLLM will attempt to download it from Hugging Face.
# Ensure you have sufficient GPU memory for the chosen model.
model_name = os.environ.get("VLLM_MODEL", "facebook/opt-125m")

# Initialize the LLM engine
llm = LLM(model=model_name)

# Prepare prompts and sampling parameters
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "Write a short poem about a cat."
]
sampling_params = SamplingParams(temperature=0.7, top_p=0.95, max_tokens=100)

# Generate text
outputs = llm.generate(prompts, sampling_params)

# Print the results
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

view raw JSON →