ONNX Runtime (GPU)
ONNX Runtime is a high-performance inference engine for ONNX models. The `onnxruntime-gpu` package provides GPU acceleration (e.g., via CUDA, ROCm) for these models, building on the core ONNX Runtime. It's actively developed by Microsoft, with frequent releases often aligned with new ONNX operator sets and performance improvements, currently at version 1.24.4.
Warnings
- gotcha The `onnxruntime-gpu` package requires a specific CUDA Toolkit and cuDNN version to be installed on your system. Mismatched versions are a very common cause of `InferenceSession` initialization failures or runtime errors.
- gotcha When using `onnxruntime-gpu`, you must explicitly specify execution providers like `['CUDAExecutionProvider', 'CPUExecutionProvider']` during `InferenceSession` creation to ensure GPU acceleration is attempted. If not specified, ONNX Runtime might default to CPU execution even with the GPU package installed.
- gotcha There are two main PyPI packages: `onnxruntime` (CPU-only) and `onnxruntime-gpu` (GPU-enabled). Installing `onnxruntime-gpu` does *not* automatically remove `onnxruntime`. If both are installed, `onnxruntime` might be used by default or cause conflicts, leading to unexpected CPU-only execution.
- breaking Starting with ONNX Runtime version 1.17, official support for Python 3.8 and 3.9 was dropped. Version 1.24.0 and later also dropped support for Python 3.10. The current version (1.24.4) explicitly requires Python >= 3.11.
Install
-
pip install onnxruntime-gpu
Imports
- InferenceSession
import onnxruntime as ort session = ort.InferenceSession(...)
Quickstart
import onnxruntime as ort
import numpy as np
import onnx
from onnx import helper, TensorProto
import os
# 1. Create a dummy ONNX model for demonstration
# Define the graph (input, output, and node)
X = helper.make_tensor_value_info('X', TensorProto.FLOAT, [None, 3])
Y = helper.make_tensor_value_info('Y', TensorProto.FLOAT, [None, 3])
node = helper.make_node('Relu', ['X'], ['Y'])
graph = helper.make_graph([node], 'simple_relu', [X], [Y])
model = helper.make_model(graph, producer_name='onnx-example')
# Save it to a temporary file
model_path = "simple_relu.onnx"
onnx.save(model, model_path)
# 2. Load the model with GPU provider
try:
# Prioritize CUDAExecutionProvider for NVIDIA GPUs
# Fallback to CPUExecutionProvider if CUDA is not available or fails
session = ort.InferenceSession(
model_path,
providers=["CUDAExecutionProvider", "CPUExecutionProvider"]
)
print("ONNX Runtime session created with providers:", session.get_providers())
# Prepare dummy input data
input_data = np.random.rand(1, 3).astype(np.float32)
# Run inference
output = session.run(None, {'X': input_data})
print("Inference successful. Output shape:", output[0].shape)
except Exception as e:
print(f"\nError creating ONNX Runtime session or running inference: {e}")
print("Make sure you have a compatible CUDA environment (or other GPU runtime) ")
print("and the correct onnxruntime-gpu package installed. \n")
print("If CUDA is not available, try removing 'CUDAExecutionProvider' from the providers list.")
finally:
# Clean up the dummy model file
if os.path.exists(model_path):
os.remove(model_path)