{"id":7465,"library":"onnxruntime-genai","title":"ONNX Runtime GenAI","description":"ONNX Runtime GenAI is a Python library that provides an easy, flexible, and performant way to run Generative AI models (Large Language Models and multi-modal models) on-device and in the cloud using ONNX Runtime. It encapsulates the complete generative AI loop, including pre- and post-processing, inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. The library is actively developed, with version 0.13.1 released in April 2026, generally following a quarterly release cadence in line with the broader ONNX Runtime project.","status":"active","version":"0.13.1","language":"en","source_language":"en","source_url":"https://github.com/microsoft/onnxruntime-genai","tags":["AI","LLM","ONNX","inference","deep learning","genai","on-device","multi-modal"],"install":[{"cmd":"pip install onnxruntime-genai","lang":"bash","label":"For CPU"},{"cmd":"pip install onnxruntime-genai-directml","lang":"bash","label":"For DirectML (Windows)"},{"cmd":"pip install onnxruntime-genai-cuda","lang":"bash","label":"For CUDA 12 (Requires CUDA Toolkit and CUDA_PATH env var)"}],"dependencies":[{"reason":"Core runtime dependency; separated from onnxruntime-genai since version 0.4.0.","package":"onnxruntime","optional":false},{"reason":"Required for array manipulation, especially for model inputs/outputs.","package":"numpy","optional":false},{"reason":"Often used for downloading ONNX models via `huggingface-cli` for local inference.","package":"huggingface_hub","optional":true}],"imports":[{"symbol":"onnxruntime_genai","correct":"import onnxruntime_genai as og"}],"quickstart":{"code":"import os\nimport onnxruntime_genai as og\n\n# --- Prerequisite: Download a model ---\n# The following shell command downloads the Phi-3 Mini 4K Instruct ONNX model (CPU-INT4 quantized).\n# You will need to install huggingface_hub: pip install huggingface_hub\n# Run this command in your terminal before executing the Python code:\n# huggingface-cli download microsoft/Phi-3-mini-4k-instruct-onnx \\\n#   --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* \\\n#   --local-dir ./phi-3-mini-onnx\n\nmodel_path = os.environ.get('ONNX_MODEL_PATH', './phi-3-mini-onnx')\n\ntry:\n    # 1. Load the model\n    model = og.Model(model_path)\n    print(f\"Loaded {model.type} on {model.device_type}\")\n\n    # 2. Create a tokenizer\n    tokenizer = og.Tokenizer(model)\n\n    # 3. Create generator parameters\n    params = og.GeneratorParams(model)\n    params.set_search_options(max_length=200, top_p=0.9, temperature=0.7)\n\n    # 4. Encode initial prompt and append to generator\n    prompt = \"The capital of France is\"\n    input_tokens = tokenizer.encode(prompt)\n\n    # 5. Create a generator instance\n    generator = og.Generator(model, params)\n    generator.append_tokens(input_tokens)\n\n    print(f\"Prompt: {prompt}\")\n    print(\"Generated text:\", end=\"\")\n\n    # 6. Generate tokens one by one and decode for streaming output\n    while not generator.is_done():\n        generator.generate_next_token()\n        last_token = generator.get_sequence(0)[-1]\n        print(tokenizer.decode([last_token]), end=\"\", flush=True)\n    print()\n\n    # Get the full decoded sequence (optional, for non-streaming output)\n    # output = tokenizer.decode(generator.get_sequence(0))\n    # print(f\"\\nFull output: {output}\")\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n    print(f\"Please ensure the model is downloaded to '{model_path}' and all dependencies are installed.\")","lang":"python","description":"This quickstart demonstrates how to load a pre-optimized ONNX model (like Phi-3 Mini), tokenize an input prompt, and generate text using the `onnxruntime-genai` library. Before running the Python code, you must download an ONNX model, typically using `huggingface-cli` into a local directory. The example uses environment variables for the model path for flexibility."},"warnings":[{"fix":"Replace `params.input_ids = input_tokens` with `generator.append_tokens(input_tokens)` after the generator object is created. Remove calls to `generator.compute_logits()`. For multi-turn conversations, create a loop and call `generator.append_tokens(new_prompt_tokens)` for each turn.","message":"API changes in version 0.6.0 for 'chat mode' (continuation/continuous decoding) introduced a breaking change. The `GeneratorParams.input_ids` attribute and `generator.compute_logits()` method were replaced or made redundant.","severity":"breaking","affected_versions":"<=0.5.2"},{"fix":"Upgrade `onnxruntime-genai` to 0.5.0 or later, or downgrade `transformers` to a version lower than 4.45.0.","message":"ONNX Runtime GenAI versions 0.4.0 and earlier were incompatible with `transformers` library version 4.45.0 and later when using the Model Builder tool, leading to `RuntimeError: [json.exception.type_error.302]` if `tokenizer_config.json` contained an array for the `model_input_names` field.","severity":"gotcha","affected_versions":"0.1.0 - 0.4.0"},{"fix":"Use Python versions 3.10, 3.11, or 3.12 until official support for 3.13 is released.","message":"Pre-built wheels for `onnxruntime-genai` currently do not support Python 3.13. Attempting to install may result in `ERROR: No matching distribution found`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For stability, use examples from the corresponding version tag (e.g., `v0.13.1` branch) if using pre-built binaries, or build `onnxruntime-genai` from source if using examples directly from the `main` branch.","message":"Examples in the `main` branch of the GitHub repository may not be compatible with the latest stable PyPI release binaries due to ongoing development.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If upgrading to Ryzen AI 1.7, download the updated, compatible models.","message":"Models from earlier Ryzen AI releases are not compatible with Ryzen AI 1.7 (which uses OGA v0.11.2 from v0.9.2.2).","severity":"breaking","affected_versions":"Ryzen AI 1.6.1 and earlier models"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure `pip install onnxruntime-genai` was run successfully in the correct virtual environment, or switch to the appropriate Python kernel in your IDE/notebook.","cause":"The package `onnxruntime-genai` is not installed in the active Python environment or the environment is not correctly activated (e.g., in a Jupyter Notebook with the wrong kernel).","error":"ModuleNotFoundError: No module named 'onnxruntime_genai'"},{"fix":"In your Conda environment, run: `conda install conda-forge::vs2015_runtime`.","cause":"This usually occurs in a Conda environment on Windows due to an outdated C++ runtime for Visual Studio.","error":"ImportError: DLL load failed while importing onnxruntime_genai: A dynamic link library (DLL) initialization routine failed."},{"fix":"Ensure the `CUDA_PATH` system environment variable is set to the installation directory of your CUDA Toolkit.","cause":"On Windows with CUDA, this error often means the `CUDA_PATH` environment variable is not correctly set after CUDA Toolkit installation.","error":"DLL load failed while importing onnxruntime_genai"},{"fix":"Use Python 3.10, 3.11, or 3.12, which have supported pre-built distributions.","cause":"The currently used Python version (e.g., Python 3.13) does not have pre-built wheels available for `onnxruntime-genai` on PyPI.","error":"ERROR: No matching distribution found for onnxruntime-genai"},{"fix":"Upgrade `onnxruntime-genai` to version 0.5.0 or newer, or downgrade `transformers` to a version prior to 4.45.0.","cause":"Incompatibility between `onnxruntime-genai` (versions <= 0.4.0) and `HuggingFace transformers` (versions >= 4.45.0) when `tokenizer_config.json` uses an array for `model_input_names`.","error":"RuntimeError: [json.exception.type_error.302] type must be string, but is array."}]}