{"library":"tritonclient","title":"Triton Inference Server Python Client","description":"The `tritonclient` library provides Python APIs for interacting with NVIDIA Triton Inference Server. It supports both HTTP/REST and gRPC protocols, allowing applications to send inference requests, retrieve server and model status, manage models, and perform other tasks. Currently at version 2.67.0 (released March 27, 2026), it is actively maintained with a release cadence that generally aligns with the broader Triton Inference Server project.","status":"active","version":"2.67.0","language":"en","source_language":"en","source_url":"https://github.com/triton-inference-server/client","tags":["inference","deep learning","machine learning","NVIDIA","client","gRPC","HTTP","AI"],"install":[{"cmd":"pip install tritonclient","lang":"bash","label":"Base installation"},{"cmd":"pip install tritonclient[all]","lang":"bash","label":"With all optional dependencies (e.g., CUDA shared memory)"}],"dependencies":[{"reason":"Required for creating and handling input/output tensors.","package":"numpy","optional":false},{"reason":"Required for gRPC client functionality.","package":"grpcio","optional":true},{"reason":"Required for gRPC client functionality (often installed with grpcio).","package":"grpcio-tools","optional":true},{"reason":"Used for asynchronous HTTP client operations.","package":"gevent","optional":true},{"reason":"Required for `cuda_shared_memory` utilities.","package":"cupy","optional":true}],"imports":[{"symbol":"InferenceServerClient","correct":"from tritonclient.http import InferenceServerClient"},{"symbol":"InferenceServerClient","correct":"from tritonclient.grpc import InferenceServerClient"},{"note":"InferInput is specific to HTTP or gRPC client, depending on which one you're using. Import from the correct submodule (http or grpc).","wrong":"from tritonclient.grpc import InferInput","symbol":"InferInput","correct":"from tritonclient.http import InferInput"},{"note":"InferRequestedOutput is specific to HTTP or gRPC client, depending on which one you're using. Import from the correct submodule (http or grpc).","wrong":"from tritonclient.http import InferRequestedOutput","symbol":"InferRequestedOutput","correct":"from tritonclient.grpc import InferRequestedOutput"},{"symbol":"InferenceServerException","correct":"from tritonclient.utils import InferenceServerException"},{"symbol":"ProtocolType","correct":"from tritonclient.utils import ProtocolType"}],"quickstart":{"code":"import numpy as np\nimport tritonclient.http as tritonhttp\nimport os\n\nTRITON_SERVER_URL = os.environ.get('TRITON_SERVER_URL', 'localhost:8000')\nMODEL_NAME = 'simple_model'\nMODEL_VERSION = '1'\nINPUT_NAME = 'input_0'\nOUTPUT_NAME = 'output_0'\n\ndef main():\n    try:\n        # Create a Triton HTTP client\n        client = tritonhttp.InferenceServerClient(url=TRITON_SERVER_URL)\n\n        # Check server readiness\n        if not client.is_server_ready():\n            print(f\"Triton server at {TRITON_SERVER_URL} is not ready.\")\n            return\n        print(f\"Triton server at {TRITON_SERVER_URL} is ready.\")\n\n        # Prepare input data (e.g., a simple numpy array)\n        input_data = np.random.rand(1, 16).astype(np.float32)\n        \n        # Create InferInput object\n        infer_input = tritonhttp.InferInput(INPUT_NAME, input_data.shape, 'FP32')\n        infer_input.set_data_from_numpy(input_data, binary_data=True)\n\n        # Create InferRequestedOutput object\n        infer_output = tritonhttp.InferRequestedOutput(OUTPUT_NAME, binary_data=True)\n\n        # Send inference request\n        response = client.infer(\n            model_name=MODEL_NAME,\n            inputs=[infer_input],\n            outputs=[infer_output],\n            model_version=MODEL_VERSION\n        )\n\n        # Get output as numpy array\n        output_data = response.as_numpy(OUTPUT_NAME)\n        print(f\"Inference successful! Output shape: {output_data.shape}\")\n        print(f\"First 5 output values: {output_data.flatten()[:5]}\")\n\n    except Exception as e:\n        print(f\"An error occurred: {e}\")\n\nif __name__ == '__main__':\n    main()","lang":"python","description":"This quickstart demonstrates how to initialize an HTTP client, check server readiness, prepare input tensors using NumPy, send an inference request to a hypothetical 'simple_model', and process the returned output. Remember to replace `TRITON_SERVER_URL`, `MODEL_NAME`, `MODEL_VERSION`, `INPUT_NAME`, and `OUTPUT_NAME` with your actual server and model details. For gRPC, import `tritonclient.grpc` instead and use `tritonclient.grpc.InferenceServerClient`."},"warnings":[{"fix":"Explicitly import `InferInput` and `InferRequestedOutput` from `tritonclient.http` or `tritonclient.grpc` as appropriate for your client instance.","message":"When using `InferInput` or `InferRequestedOutput`, ensure you import them from the correct protocol submodule (`tritonclient.http` or `tritonclient.grpc`) corresponding to the client you are using. Mixing them will lead to errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"When creating NumPy arrays for BYTES tensors, set the dtype to `np.object_`.","message":"For BYTES tensors (variable-length binary data/strings), it is recommended to use `numpy.object_` for the dtype of the NumPy array. While `numpy.bytes_` is supported for backward compatibility, `numpy.object_` is the preferred and more robust type.","severity":"gotcha","affected_versions":"All versions, specifically relevant from recent versions onwards"},{"fix":"Be aware of these limitations. For gRPC, ensure your server is responsive to avoid indefinite waits. For HTTP, consider increasing timeout values if experiencing unexpected delays with short timeouts.","message":"The gRPC client (`tritonclient.grpc.InferenceServerClient`) has known limitations where it does not support timeouts for model configuration and model metadata requests. The HTTP client may also not correctly respect timeouts under 1 second.","severity":"gotcha","affected_versions":"All recent versions"},{"fix":"If shared memory is required in multithreaded Python clients, consider using system shared memory (`tritonclient.utils.shared_memory`) or investigate alternatives until the CUDA shared memory issue is resolved in CuPy.","message":"Avoid using `tritonclient.utils.cuda_shared_memory` APIs in multithreaded environments. There are known issues that can lead to instability until fixed by the underlying CuPy library.","severity":"gotcha","affected_versions":"All recent versions (e.g., Triton Inference Server 26.01 and earlier)"},{"fix":"If your application relies on response order for decoupled models, ensure you handle responses asynchronously and correlate them using request IDs or other unique identifiers.","message":"When communicating with decoupled models (models that can return multiple responses over time), the order of responses received by the streaming gRPC client may not always match the order in which they were sent by the backend for different requests.","severity":"gotcha","affected_versions":"All recent versions"},{"fix":"For ARM SBSA, obtain the correct client wheel file directly from the ARM SBSA SDK image and install it manually, rather than relying on `pip install tritonclient`.","message":"Triton Client PIP wheels for ARM SBSA are not available on PyPI. Installing `tritonclient` via `pip` on ARM SBSA systems may result in an incorrect Jetson version of the library being installed, leading to compatibility issues.","severity":"gotcha","affected_versions":"All recent versions"}],"env_vars":null,"last_verified":"2026-04-05T00:00:00.000Z","next_check":"2026-07-04T00:00:00.000Z"}