Arize Phoenix Evals

3.0.0 · active · verified Mon Apr 13

Phoenix Evals provides lightweight, composable building blocks for writing and running evaluations on LLM applications. It offers tools for determining relevance, toxicity, hallucination detection, and more. The library is actively developed, with version 3.0.0 being the current release, and features frequent updates as part of the broader Arize Phoenix ecosystem.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up an LLM-based classification evaluator using the `arize-phoenix-evals` library with an OpenAI model. It covers defining an evaluator with a prompt template and performing evaluations on both simple and nested input data, showcasing input mapping.

import os
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

# Set your OpenAI API key from environment variable
os.environ["OPENAI_API_KEY"] = os.environ.get('OPENAI_API_KEY', 'sk-your-openai-key') # Replace with actual key or ensure env var is set

# Create an LLM instance (ensure OPENAI_API_KEY is set in environment)
llm = LLM(provider="openai", model="gpt-4o")

# Create a custom classification evaluator
evaluator = create_classifier(
    name="helpfulness",
    prompt_template="Rate the response to the user query as helpful or not:\n\nQuery: {input}\nResponse: {output}",
    llm=llm,
    choices={"helpful": 1.0, "not_helpful": 0.0},
)

# Simple evaluation on a single record
scores = evaluator.evaluate({"input": "How do I reset the device?", "output": "Go to settings > reset."})
print(f"Simple evaluation score: {scores[0].score}, label: {scores[0].label}")

# Evaluation with input mapping for nested data
scores_nested = evaluator.evaluate(
    {"data": {"query": "How do I restart the app?", "response": "Close and reopen the application."}},
    input_mapping={"input": "data.query", "output": "data.response"}
)
print(f"Nested evaluation score: {scores_nested[0].score}, label: {scores_nested[0].label}")

view raw JSON →