Azure AI Evaluation SDK for Python
The Azure AI Evaluation SDK for Python provides tools to quantitatively measure the performance of generative AI applications. It offers built-in and custom evaluators for mathematical, AI-assisted quality, and safety metrics, enabling comprehensive insights into application capabilities and limitations. This library is actively developed, with recent releases focusing on bug fixes and new features, maintaining a regular release cadence as part of the broader Azure SDK for Python.
Warnings
- breaking Environment variable `PF_EVALS_BATCH_USE_ASYNC` was renamed to `AI_EVALS_BATCH_USE_ASYNC`. Input requirements for `RetrievalEvaluator`, `RelevanceEvaluator`, and `FluencyEvaluator` have changed.
- breaking A breaking change in the OpenAI Python package (e.g., removal of `eval_string_check_grader` in v1.78.0) can cause compatibility issues and silent failures (returning zero scores) with Azure AI Evaluation SDK's custom graders like `AzureOpenAIPythonGrader`.
- gotcha Evaluations can get stuck in 'Starting' or 'Running' state due to insufficient Azure OpenAI model capacity/quota, misconfigured authentication/permissions (e.g., missing 'Azure AI User' role for `DefaultAzureCredential`), incorrect dataset/mapping, or hitting rate limits.
- gotcha Embedding evaluation configuration directly within evaluation scripts can lead to 'configuration drift,' where different parts of the system measure metrics inconsistently, making historical comparisons unreliable.
- deprecated The environment variable `PF_EVALS_BATCH_USE_ASYNC` was deprecated and renamed. The `[remote]` extra for installation has been removed as it's no longer needed when tracking results in Azure AI Studio.
- breaking Fixed Jinja2 Server-Side Template Injection (SSTI) vulnerability (CWE-1336) by replacing `jinja2.Template` with `jinja2.sandbox.SandboxedEnvironment` across all template rendering paths.
Install
-
pip install azure-ai-evaluation
Imports
- evaluate
from azure.ai.evaluation import evaluate
- RelevanceEvaluator
from azure.ai.evaluation import RelevanceEvaluator
- BleuScoreEvaluator
from azure.ai.evaluation import BleuScoreEvaluator
- ViolenceEvaluator
from azure.ai.evaluation import ViolenceEvaluator
Quickstart
import os
from azure.ai.evaluation import evaluate, RelevanceEvaluator
# Ensure environment variables are set for Azure OpenAI
# AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_KEY, AZURE_OPENAI_DEPLOYMENT
model_config = {
"azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT", ""),
"api_key": os.environ.get("AZURE_OPENAI_KEY", ""),
"azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT", ""),
}
# Example for a simple AI-assisted quality evaluation
relevance_evaluator = RelevanceEvaluator(model_config=model_config)
# For a conversation/turn based evaluation
# result = relevance_evaluator(
# query="What is the capital of Japan?",
# response="Tokyo is the capital of Japan."
# )
# For evaluating a dataset
data_for_evaluation = [
{"id": "1", "query": "What is the capital of France?", "response": "Paris.", "context": "France is a country in Europe. Its capital is Paris."},
{"id": "2", "query": "Who painted the Mona Lisa?", "response": "Leonardo da Vinci.", "context": "Leonardo da Vinci was an Italian polymath."}
]
# You can use `evaluate` function for batch evaluation on a dataset
# Ensure you have a configured Azure AI Project if logging results to AI Studio
# azure_ai_project = {
# "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID", ""),
# "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP", ""),
# "project_name": os.environ.get("AZURE_AI_PROJECT_NAME", ""),
# }
# results = evaluate(
# data=data_for_evaluation,
# evaluators=[relevance_evaluator],
# # azure_ai_project=azure_ai_project # Uncomment to log to AI Studio
# )
print("Evaluators initialized. Ready for evaluation.")