vLLM TPU

0.19.0 verified Sat May 09 auth: no python

vLLM TPU is a variant of vLLM that runs on Google Cloud TPUs (v5e/v5p). It provides a high-throughput and memory-efficient inference and serving engine for large language models, leveraging TPU-specific optimizations like Pallas kernels for attention and quantization. The current version is 0.19.0, following the main vLLM release cadence (monthly).

pip install vllm-tpu

Common errors

error ModuleNotFoundError: No module named 'vllm' ↓

cause vllm-tpu package does not export a top-level 'vllm' module? Actually it does. If you see this, the installation might be broken or you are trying to import from the wrong environment.

fix

Ensure vllm-tpu is installed in the current environment: pip list | grep vllm

error RuntimeError: TPU not found ↓

cause Code running on a non-TPU machine (e.g., GPU or CPU).

fix

Run on a TPU VM. Alternatively, set VLLM_DEVICE='cpu' for CPU fallback (but that's not TPU).

error ValueError: Unsupported model architecture: ... ↓

cause Model not compatible with vLLM TPU (e.g., models requiring custom CUDA kernels).

fix

Use a model from the list of supported architectures: LLama, Mistral, Qwen2, etc.

Warnings

gotcha vLLM TPU is experimental and does not support all features of the main vLLM (e.g., tensor parallelism, quantization). Check the official docs for supported model architectures and features. ↓

fix Verify model compatibility before use; refer to the vLLM TPU docs.

gotcha You must run on a TPU VM (v5e/v5p) with torch_xla installed. Installing vllm-tpu on CPU/GPU will fail. ↓

fix Provision a TPU VM and install the TPU runtime: https://cloud.google.com/tpu/docs/users-guide-tpu-vm

breaking As of v0.19.0, the vllm-tpu package is a separate PyPI package from vllm. Mixing installations may cause conflicts. ↓

fix Uninstall vllm first: pip uninstall vllm; then pip install vllm-tpu.

Imports

LLM
```
from vllm import LLM
```
The entry point is the same as the main vLLM package; no separate import module.
SamplingParams
```
from vllm import SamplingParams
```
AsyncLLMEngine
```
from vllm import AsyncLLMEngine
```

Quickstart

Basic inference with a small model on TPU. Assumes a TPU VM (v5e/v5p) with torch_xla installed.

import os
os.environ['VLLM_TPU'] = '1'  # Optional: explicitly enable TPU backend
from vllm import LLM, SamplingParams

prompts = ["Hello, my name is", "The capital of France is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="Qwen/Qwen2.5-1.5B", max_num_seqs=8)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)