Arize Phoenix Evals
Phoenix Evals provides lightweight, composable building blocks for writing and running evaluations on LLM applications. It offers tools for determining relevance, toxicity, hallucination detection, and more. The library is actively developed, with version 3.0.0 being the current release, and features frequent updates as part of the broader Arize Phoenix ecosystem.
Warnings
- breaking Version 3.0.0 of `arize-phoenix-evals` (and Phoenix v14.0.0) deprecates and removes the 'evals 1.0' module and the legacy experiments module. The `/v1/evaluations` REST endpoint has also been removed from the Phoenix server.
- breaking The legacy `phoenix.session.client.Client` (accessed as `px.Client()`) has been removed in Phoenix v14.0.0. All client interactions now go through `arize-phoenix-client`.
- gotcha When using LLM-based evaluators, you must separately install the SDK for your chosen LLM vendor (e.g., `openai` for OpenAI models, `langchain` for LangChain integrations). `arize-phoenix-evals` does not bundle these dependencies.
- gotcha Starting with `arize-phoenix-evals` 2.12.0, evaluators automatically JSON-serialize structured data (dicts, lists) passed as template variable values. Manually `str()`-ing complex objects is no longer necessary and could lead to incorrect prompt rendering.
Install
-
pip install arize-phoenix-evals -
pip install 'arize-phoenix-evals>=2.0.0' openai
Imports
- create_classifier
from phoenix.evals import create_classifier
- LLM
from phoenix.evals.llm import LLM
- evaluate_dataframe
from phoenix.evals import evaluate_dataframe
Quickstart
import os
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM
# Set your OpenAI API key from environment variable
os.environ["OPENAI_API_KEY"] = os.environ.get('OPENAI_API_KEY', 'sk-your-openai-key') # Replace with actual key or ensure env var is set
# Create an LLM instance (ensure OPENAI_API_KEY is set in environment)
llm = LLM(provider="openai", model="gpt-4o")
# Create a custom classification evaluator
evaluator = create_classifier(
name="helpfulness",
prompt_template="Rate the response to the user query as helpful or not:\n\nQuery: {input}\nResponse: {output}",
llm=llm,
choices={"helpful": 1.0, "not_helpful": 0.0},
)
# Simple evaluation on a single record
scores = evaluator.evaluate({"input": "How do I reset the device?", "output": "Go to settings > reset."})
print(f"Simple evaluation score: {scores[0].score}, label: {scores[0].label}")
# Evaluation with input mapping for nested data
scores_nested = evaluator.evaluate(
{"data": {"query": "How do I restart the app?", "response": "Close and reopen the application."}},
input_mapping={"input": "data.query", "output": "data.response"}
)
print(f"Nested evaluation score: {scores_nested[0].score}, label: {scores_nested[0].label}")