vLLM-Omni
raw JSON → 0.20.0 verified Sat May 09 auth: no python
vLLM-Omni is a framework for efficient model inference with omni-modality models, built on top of vLLM. It supports speech, image, video, audio, and multimodal generation, aligned with upstream vLLM releases. Current version is 0.20.0, with active development and monthly release cadence.
pip install vllm-omni Common errors
error ModuleNotFoundError: No module named 'vllm_omni' ↓
cause Package name on PyPI is vllm-omni (with hyphen), but import uses vllm (no hyphen).
fix
Install with pip install vllm-omni and import from vllm (e.g., from vllm import LLM).
error ImportError: cannot import name 'LLM' from 'vllm_omni' ↓
cause vllm_omni does not export LLM; it re-exports from vllm.
fix
Use from vllm import LLM (install vllm if needed).
error ValueError: The model is not supported by vLLM-Omni ↓
cause Model type not yet implemented or requires trust_remote_code=True.
fix
Check model list in docs, or add trust_remote_code=True and ensure model is in supported list.
Warnings
breaking vLLM-Omni versions must exactly match the upstream vLLM version they are built against. Using mismatched versions (e.g., vllm-omni 0.20.0 with vllm 0.19.0) will cause import errors or silent failures. ↓
fix Install the exact matching vLLM version, typically vllm==<same_version>. E.g., pip install vllm==0.20.0 vllm-omni==0.20.0
gotcha Do not import from vllm_omni directly. All core classes (LLM, SamplingParams, etc.) are re-exported from vllm. Importing from vllm_omni will raise ImportError. ↓
fix Use import from vllm (e.g., from vllm import LLM) instead of from vllm_omni.
deprecated The old entrypoint vllm.entrypoints.openai.api_server is deprecated. Use vllm serve command-line or vllm.entrypoints.openai.run_batch for batch inference. ↓
fix Use `vllm serve` CLI or the new async engine API.
Imports
- LLM wrong
from vllm_omni import LLMcorrectfrom vllm import LLM - SamplingParams wrong
from vllm_omni import SamplingParamscorrectfrom vllm import SamplingParams - AsyncLLMEngine
from vllm.engine.async_llm_engine import AsyncLLMEngine
Quickstart
from vllm import LLM, SamplingParams
# Load a multimodal model (e.g., Qwen2-VL)
llm = LLM(model="Qwen/Qwen2-VL-7B-Instruct", trust_remote_code=True)
# Generate with a multimodal prompt
prompt = {
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image."}
]
}
sampling_params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)