vLLM TPU
raw JSON → 0.19.0 verified Sat May 09 auth: no python
vLLM TPU is a variant of vLLM that runs on Google Cloud TPUs (v5e/v5p). It provides a high-throughput and memory-efficient inference and serving engine for large language models, leveraging TPU-specific optimizations like Pallas kernels for attention and quantization. The current version is 0.19.0, following the main vLLM release cadence (monthly).
pip install vllm-tpu Common errors
error ModuleNotFoundError: No module named 'vllm' ↓
cause vllm-tpu package does not export a top-level 'vllm' module? Actually it does. If you see this, the installation might be broken or you are trying to import from the wrong environment.
fix
Ensure vllm-tpu is installed in the current environment: pip list | grep vllm
error RuntimeError: TPU not found ↓
cause Code running on a non-TPU machine (e.g., GPU or CPU).
fix
Run on a TPU VM. Alternatively, set VLLM_DEVICE='cpu' for CPU fallback (but that's not TPU).
error ValueError: Unsupported model architecture: ... ↓
cause Model not compatible with vLLM TPU (e.g., models requiring custom CUDA kernels).
fix
Use a model from the list of supported architectures: LLama, Mistral, Qwen2, etc.
Warnings
gotcha vLLM TPU is experimental and does not support all features of the main vLLM (e.g., tensor parallelism, quantization). Check the official docs for supported model architectures and features. ↓
fix Verify model compatibility before use; refer to the vLLM TPU docs.
gotcha You must run on a TPU VM (v5e/v5p) with torch_xla installed. Installing vllm-tpu on CPU/GPU will fail. ↓
fix Provision a TPU VM and install the TPU runtime: https://cloud.google.com/tpu/docs/users-guide-tpu-vm
breaking As of v0.19.0, the vllm-tpu package is a separate PyPI package from vllm. Mixing installations may cause conflicts. ↓
fix Uninstall vllm first: pip uninstall vllm; then pip install vllm-tpu.
Imports
- LLM
from vllm import LLM - SamplingParams
from vllm import SamplingParams - AsyncLLMEngine
from vllm import AsyncLLMEngine
Quickstart
import os
os.environ['VLLM_TPU'] = '1' # Optional: explicitly enable TPU backend
from vllm import LLM, SamplingParams
prompts = ["Hello, my name is", "The capital of France is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="Qwen/Qwen2.5-1.5B", max_num_seqs=8)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)