LMDeploy

raw JSON →
0.12.3 verified Fri May 01 auth: no python

LMDeploy is a toolkit for compressing, deploying, and serving large language models (LLMs). It supports efficient inference with quantization, continuous batching, and various backends (e.g., PyTorch, TensorRT-LLM). The current version is 0.12.3, with frequent releases following the development of dependencies and model support.

pip install lmdeploy
error ModuleNotFoundError: No module named 'lmdeploy.turbomind'
cause In recent versions, `turbomind` is not a separate importable module; its classes are moved to `lmdeploy` namespace.
fix
Use from lmdeploy import TurbomindEngineConfig instead.
error ImportError: cannot import name 'pipeline' from 'lmdeploy.serve'
cause The `pipeline` function is not in `lmdeploy.serve`; it is in the top-level `lmdeploy` module.
fix
Use from lmdeploy import pipeline.
error ValueError: Unsupported model format 'xxxx'
cause The model_format argument in TurbomindEngineConfig expects one of the supported formats (e.g., 'hf', 'awq', 'w4a16', 'w8a8'). An incorrect string causes this error.
fix
Check the model format and use a valid one. For Hugging Face models, use model_format='hf'.
breaking The `TurbomindEngineConfig` import path changed. In versions before 0.12.0, it was `from lmdeploy.turbomind import TurbomindEngineConfig`. Now it is `from lmdeploy import TurbomindEngineConfig`.
fix Update imports to `from lmdeploy import TurbomindEngineConfig`.
deprecated The `turbomind` backend is deprecated; use `TurbomindEngineConfig` with model_format='hf' or 'awq' instead of direct Turbomind engine creation.
fix Switch to using the pipeline with `TurbomindEngineConfig`.
gotcha When using `pipeline`, the model must be in Hugging Face format (HF) or quantized with LMDeploy's format. Passing a model name without the correct format may cause silent fallback or errors.
fix Explicitly set `model_format` in `TurbomindEngineConfig` (e.g., `model_format='hf'`) or use the `--model-format` argument when using CLI.

Initialize a pipeline with a Hugging Face model and engine config, then generate a response.

from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig

engine_config = TurbomindEngineConfig(model_format='hf', tp=1)
pipe = pipeline('internlm/internlm2_5-1_8b', engine_config=engine_config)
response = pipe('Hello, how are you?')
print(response.text)