vLLM TPU

raw JSON →
0.19.0 verified Sat May 09 auth: no python

vLLM TPU is a variant of vLLM that runs on Google Cloud TPUs (v5e/v5p). It provides a high-throughput and memory-efficient inference and serving engine for large language models, leveraging TPU-specific optimizations like Pallas kernels for attention and quantization. The current version is 0.19.0, following the main vLLM release cadence (monthly).

pip install vllm-tpu
error ModuleNotFoundError: No module named 'vllm'
cause vllm-tpu package does not export a top-level 'vllm' module? Actually it does. If you see this, the installation might be broken or you are trying to import from the wrong environment.
fix
Ensure vllm-tpu is installed in the current environment: pip list | grep vllm
error RuntimeError: TPU not found
cause Code running on a non-TPU machine (e.g., GPU or CPU).
fix
Run on a TPU VM. Alternatively, set VLLM_DEVICE='cpu' for CPU fallback (but that's not TPU).
error ValueError: Unsupported model architecture: ...
cause Model not compatible with vLLM TPU (e.g., models requiring custom CUDA kernels).
fix
Use a model from the list of supported architectures: LLama, Mistral, Qwen2, etc.
gotcha vLLM TPU is experimental and does not support all features of the main vLLM (e.g., tensor parallelism, quantization). Check the official docs for supported model architectures and features.
fix Verify model compatibility before use; refer to the vLLM TPU docs.
gotcha You must run on a TPU VM (v5e/v5p) with torch_xla installed. Installing vllm-tpu on CPU/GPU will fail.
fix Provision a TPU VM and install the TPU runtime: https://cloud.google.com/tpu/docs/users-guide-tpu-vm
breaking As of v0.19.0, the vllm-tpu package is a separate PyPI package from vllm. Mixing installations may cause conflicts.
fix Uninstall vllm first: pip uninstall vllm; then pip install vllm-tpu.

Basic inference with a small model on TPU. Assumes a TPU VM (v5e/v5p) with torch_xla installed.

import os
os.environ['VLLM_TPU'] = '1'  # Optional: explicitly enable TPU backend
from vllm import LLM, SamplingParams

prompts = ["Hello, my name is", "The capital of France is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="Qwen/Qwen2.5-1.5B", max_num_seqs=8)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)