Xinference Client

2.5.0 · active · verified Thu Apr 16

Xinference-client is the official Python client library for interacting with a Xinference server. It allows users to manage and deploy various AI models (LLMs, embedding models, speech models, etc.) and perform inference. The library is actively maintained with frequent minor releases, often bi-weekly or monthly, reflecting updates in the broader Xinference ecosystem.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to connect to a running Xinference server, verify model availability, and perform a chat completion using the `Client` object. It assumes a Xinference server is running and a Large Language Model (LLM) is already launched on it.

import os
from xinference.client import Client

# Ensure a Xinference server is running, e.g., using 'xinference-local' command.
# Ensure an LLM (e.g., 'llama-2-chat') is launched on it.
XINFERENCE_SERVER_URL = os.environ.get("XINFERENCE_SERVER_URL", "http://127.0.0.1:9997")
# Use a model name known to be available on your Xinference server
LLM_MODEL_NAME = os.environ.get("XINFERENCE_LLM_MODEL", "llama-2-chat")

try:
    client = Client(base_url=XINFERENCE_SERVER_URL)

    # Verify connection and model availability
    available_models = client.list_models()
    if LLM_MODEL_NAME not in available_models:
        raise ValueError(
            f"Model '{LLM_MODEL_NAME}' not found on the Xinference server. "
            "Please launch it first using `client.launch_model()` or select an available model from: "
            f"{list(available_models.keys())}"
        )

    print(f"Successfully connected to Xinference server at {XINFERENCE_SERVER_URL}")
    print(f"Using model: {LLM_MODEL_NAME}")

    # Perform a chat completion using the v2 OpenAI-compatible API
    messages = [{"role": "user", "content": "Hello, what is Xinference and what can it do?"}]
    response = client.chat.completions.create(
        model=LLM_MODEL_NAME,
        messages=messages,
        max_tokens=100
    )
    print("\nChat Completion Response:")
    print(response.choices[0].message.content)

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure your Xinference server is running, accessible, and the specified LLM is launched.")

view raw JSON →