Triton Inference Server Python Client
The `tritonclient` library provides Python APIs for interacting with NVIDIA Triton Inference Server. It supports both HTTP/REST and gRPC protocols, allowing applications to send inference requests, retrieve server and model status, manage models, and perform other tasks. Currently at version 2.67.0 (released March 27, 2026), it is actively maintained with a release cadence that generally aligns with the broader Triton Inference Server project.
Warnings
- gotcha When using `InferInput` or `InferRequestedOutput`, ensure you import them from the correct protocol submodule (`tritonclient.http` or `tritonclient.grpc`) corresponding to the client you are using. Mixing them will lead to errors.
- gotcha For BYTES tensors (variable-length binary data/strings), it is recommended to use `numpy.object_` for the dtype of the NumPy array. While `numpy.bytes_` is supported for backward compatibility, `numpy.object_` is the preferred and more robust type.
- gotcha The gRPC client (`tritonclient.grpc.InferenceServerClient`) has known limitations where it does not support timeouts for model configuration and model metadata requests. The HTTP client may also not correctly respect timeouts under 1 second.
- gotcha Avoid using `tritonclient.utils.cuda_shared_memory` APIs in multithreaded environments. There are known issues that can lead to instability until fixed by the underlying CuPy library.
- gotcha When communicating with decoupled models (models that can return multiple responses over time), the order of responses received by the streaming gRPC client may not always match the order in which they were sent by the backend for different requests.
- gotcha Triton Client PIP wheels for ARM SBSA are not available on PyPI. Installing `tritonclient` via `pip` on ARM SBSA systems may result in an incorrect Jetson version of the library being installed, leading to compatibility issues.
Install
-
pip install tritonclient -
pip install tritonclient[all]
Imports
- InferenceServerClient
from tritonclient.http import InferenceServerClient
- InferenceServerClient
from tritonclient.grpc import InferenceServerClient
- InferInput
from tritonclient.http import InferInput
- InferRequestedOutput
from tritonclient.grpc import InferRequestedOutput
- InferenceServerException
from tritonclient.utils import InferenceServerException
- ProtocolType
from tritonclient.utils import ProtocolType
Quickstart
import numpy as np
import tritonclient.http as tritonhttp
import os
TRITON_SERVER_URL = os.environ.get('TRITON_SERVER_URL', 'localhost:8000')
MODEL_NAME = 'simple_model'
MODEL_VERSION = '1'
INPUT_NAME = 'input_0'
OUTPUT_NAME = 'output_0'
def main():
try:
# Create a Triton HTTP client
client = tritonhttp.InferenceServerClient(url=TRITON_SERVER_URL)
# Check server readiness
if not client.is_server_ready():
print(f"Triton server at {TRITON_SERVER_URL} is not ready.")
return
print(f"Triton server at {TRITON_SERVER_URL} is ready.")
# Prepare input data (e.g., a simple numpy array)
input_data = np.random.rand(1, 16).astype(np.float32)
# Create InferInput object
infer_input = tritonhttp.InferInput(INPUT_NAME, input_data.shape, 'FP32')
infer_input.set_data_from_numpy(input_data, binary_data=True)
# Create InferRequestedOutput object
infer_output = tritonhttp.InferRequestedOutput(OUTPUT_NAME, binary_data=True)
# Send inference request
response = client.infer(
model_name=MODEL_NAME,
inputs=[infer_input],
outputs=[infer_output],
model_version=MODEL_VERSION
)
# Get output as numpy array
output_data = response.as_numpy(OUTPUT_NAME)
print(f"Inference successful! Output shape: {output_data.shape}")
print(f"First 5 output values: {output_data.flatten()[:5]}")
except Exception as e:
print(f"An error occurred: {e}")
if __name__ == '__main__':
main()