OpenEvals

0.2.0 · active · verified Sat Apr 11

OpenEvals is an open-source Python library providing ready-made evaluators for Large Language Model (LLM) applications. It offers a structured approach to LLM evaluation, similar to traditional software testing, with built-in functionalities like LLM-as-judge evaluators and prebuilt prompts for common evaluation scenarios such as correctness, conciseness, and hallucination detection. Developed by LangChain, it aims to streamline the process of bringing LLM applications to production by making evaluation more accessible and transparent. The current version is 0.2.0, with ongoing development and updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up and run a basic LLM-as-judge correctness evaluation. It uses a prebuilt prompt and an OpenAI model. Ensure your `OPENAI_API_KEY` environment variable is set for the example to run successfully. The evaluator returns a dictionary containing a score and a comment based on the LLM's judgment.

import os
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT

# Ensure your OpenAI API key is set as an environment variable
# For example: os.environ["OPENAI_API_KEY"] = "sk-..."
# For quickstart, we use .get to avoid immediate error if not set, but it's required for actual use.
if not os.environ.get("OPENAI_API_KEY"): print("WARNING: OPENAI_API_KEY not set. Quickstart will fail without it.")

# Create a correctness evaluator using an LLM-as-judge
correctness_evaluator = create_llm_as_judge(
    prompt=CORRECTNESS_PROMPT,
    model="openai:o3-mini", # 'o3-mini' refers to gpt-3.5-turbo-0125
)

# Define inputs, outputs, and reference outputs for evaluation
inputs = "How much has the price of doodads changed in the past year?"
outputs = "Doodads have increased in price by 10% in the past year."
reference_outputs = "The price of doodads has decreased by 50% in the past year."

# Run the evaluator
eval_result = correctness_evaluator(
    inputs=inputs,
    outputs=outputs,
    reference_outputs=reference_outputs
)

print(eval_result)
# Expected output (score might vary slightly based on LLM, but structure is consistent):
# { 'key': 'score', 'score': False, 'comment': 'The provided answer stated that doodads increased in price by 10%, which conflicts with the reference output...' }

view raw JSON →