Xinference Client
Xinference-client is the official Python client library for interacting with a Xinference server. It allows users to manage and deploy various AI models (LLMs, embedding models, speech models, etc.) and perform inference. The library is actively maintained with frequent minor releases, often bi-weekly or monthly, reflecting updates in the broader Xinference ecosystem.
Common errors
-
ModuleNotFoundError: No module named 'xinference_client'
cause Incorrect import statement. The `xinference-client` PyPI package provides its modules under the `xinference` namespace.fixChange your import from `from xinference_client import Client` to `from xinference.client import Client`. -
requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=9997): Max retries exceeded with url: /v1/models (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at ...>: Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it'))cause The Xinference server is not running or is not accessible at the specified URL and port.fixStart your Xinference server (e.g., by running `xinference-local` in your terminal). Verify the `base_url` provided to `xinference.client.Client` matches the server's address and port. -
ValueError: Model 'llama-2-chat' not found on the Xinference server.
cause The specified model has not been launched or is not available on the Xinference server.fixUse `client.list_models()` to see available models. If the desired model is not listed, launch it using `client.launch_model(model_name='your-model', model_type='LLM')` (or 'embedding', 'rerank', etc.) before attempting to use it for inference. -
AttributeError: 'Client' object has no attribute 'chat'
cause This error typically occurs when using an older version of `xinference-client` that predates the OpenAI-compatible v2 API, or when interacting with an older Xinference server. The `client.chat` attribute was introduced for the v2 API.fixIf your Xinference server is older (pre v1.0.0), you might need to use the `client.v1` interface if available, or update both your Xinference server and `xinference-client` library to the latest versions for the most up-to-date API.
Warnings
- breaking The chat completions API within `xinference-client` underwent a significant change around v2.0.0 (or Xinference server v1.0.0). Newer versions align with OpenAI's API (`client.chat.completions.create`), while older versions or backward compatibility might require `client.v1.chat.completions.create`.
- gotcha Xinference models must be explicitly launched on the Xinference server before they can be used by the client. Simply having the server running is not enough; models need to be loaded into memory.
- gotcha The Xinference client connects to a running Xinference server via HTTP. If the server is not running or the `base_url` provided to the `Client` constructor is incorrect, all client operations will fail with connection errors.
Install
-
pip install xinference-client
Imports
- Client
from xinference_client import Client
from xinference.client import Client
Quickstart
import os
from xinference.client import Client
# Ensure a Xinference server is running, e.g., using 'xinference-local' command.
# Ensure an LLM (e.g., 'llama-2-chat') is launched on it.
XINFERENCE_SERVER_URL = os.environ.get("XINFERENCE_SERVER_URL", "http://127.0.0.1:9997")
# Use a model name known to be available on your Xinference server
LLM_MODEL_NAME = os.environ.get("XINFERENCE_LLM_MODEL", "llama-2-chat")
try:
client = Client(base_url=XINFERENCE_SERVER_URL)
# Verify connection and model availability
available_models = client.list_models()
if LLM_MODEL_NAME not in available_models:
raise ValueError(
f"Model '{LLM_MODEL_NAME}' not found on the Xinference server. "
"Please launch it first using `client.launch_model()` or select an available model from: "
f"{list(available_models.keys())}"
)
print(f"Successfully connected to Xinference server at {XINFERENCE_SERVER_URL}")
print(f"Using model: {LLM_MODEL_NAME}")
# Perform a chat completion using the v2 OpenAI-compatible API
messages = [{"role": "user", "content": "Hello, what is Xinference and what can it do?"}]
response = client.chat.completions.create(
model=LLM_MODEL_NAME,
messages=messages,
max_tokens=100
)
print("\nChat Completion Response:")
print(response.choices[0].message.content)
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure your Xinference server is running, accessible, and the specified LLM is launched.")