Triton Inference Server Python Client

2.67.0 · active · verified Sun Apr 05

The `tritonclient` library provides Python APIs for interacting with NVIDIA Triton Inference Server. It supports both HTTP/REST and gRPC protocols, allowing applications to send inference requests, retrieve server and model status, manage models, and perform other tasks. Currently at version 2.67.0 (released March 27, 2026), it is actively maintained with a release cadence that generally aligns with the broader Triton Inference Server project.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize an HTTP client, check server readiness, prepare input tensors using NumPy, send an inference request to a hypothetical 'simple_model', and process the returned output. Remember to replace `TRITON_SERVER_URL`, `MODEL_NAME`, `MODEL_VERSION`, `INPUT_NAME`, and `OUTPUT_NAME` with your actual server and model details. For gRPC, import `tritonclient.grpc` instead and use `tritonclient.grpc.InferenceServerClient`.

import numpy as np
import tritonclient.http as tritonhttp
import os

TRITON_SERVER_URL = os.environ.get('TRITON_SERVER_URL', 'localhost:8000')
MODEL_NAME = 'simple_model'
MODEL_VERSION = '1'
INPUT_NAME = 'input_0'
OUTPUT_NAME = 'output_0'

def main():
    try:
        # Create a Triton HTTP client
        client = tritonhttp.InferenceServerClient(url=TRITON_SERVER_URL)

        # Check server readiness
        if not client.is_server_ready():
            print(f"Triton server at {TRITON_SERVER_URL} is not ready.")
            return
        print(f"Triton server at {TRITON_SERVER_URL} is ready.")

        # Prepare input data (e.g., a simple numpy array)
        input_data = np.random.rand(1, 16).astype(np.float32)
        
        # Create InferInput object
        infer_input = tritonhttp.InferInput(INPUT_NAME, input_data.shape, 'FP32')
        infer_input.set_data_from_numpy(input_data, binary_data=True)

        # Create InferRequestedOutput object
        infer_output = tritonhttp.InferRequestedOutput(OUTPUT_NAME, binary_data=True)

        # Send inference request
        response = client.infer(
            model_name=MODEL_NAME,
            inputs=[infer_input],
            outputs=[infer_output],
            model_version=MODEL_VERSION
        )

        # Get output as numpy array
        output_data = response.as_numpy(OUTPUT_NAME)
        print(f"Inference successful! Output shape: {output_data.shape}")
        print(f"First 5 output values: {output_data.flatten()[:5]}")

    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == '__main__':
    main()

view raw JSON →