OpenEvals
OpenEvals is an open-source Python library providing ready-made evaluators for Large Language Model (LLM) applications. It offers a structured approach to LLM evaluation, similar to traditional software testing, with built-in functionalities like LLM-as-judge evaluators and prebuilt prompts for common evaluation scenarios such as correctness, conciseness, and hallucination detection. Developed by LangChain, it aims to streamline the process of bringing LLM applications to production by making evaluation more accessible and transparent. The current version is 0.2.0, with ongoing development and updates.
Warnings
- gotcha The `model` parameter in `create_llm_as_judge` expects specific string formats (e.g., `"openai:o3-mini"`). This implies integration with LangChain's model abstraction and might differ from direct LLM client instantiation methods.
- gotcha Providing a custom `output_schema` to `create_llm_as_judge` will alter the return value of the evaluator. By default, it returns a simple dictionary with a boolean `score` and a `comment`. A custom schema will override this structure.
- gotcha Many of the core evaluators, especially LLM-as-judge evaluators, require an API key for an external LLM provider (e.g., OpenAI, Anthropic). This key must be configured in your environment, typically via an environment variable like `OPENAI_API_KEY`.
Install
-
pip install openevals
Imports
- create_llm_as_judge
from openevals.llm import create_llm_as_judge
- CORRECTNESS_PROMPT
from openevals.prompts import CORRECTNESS_PROMPT
- CONCISENESS_PROMPT
from openevals.prompts import CONCISENESS_PROMPT
- HALLUCINATION_PROMPT
from openevals.prompts import HALLUCINATION_PROMPT
Quickstart
import os
from openevals.llm import create_llm_as_judge
from openevals.prompts import CORRECTNESS_PROMPT
# Ensure your OpenAI API key is set as an environment variable
# For example: os.environ["OPENAI_API_KEY"] = "sk-..."
# For quickstart, we use .get to avoid immediate error if not set, but it's required for actual use.
if not os.environ.get("OPENAI_API_KEY"): print("WARNING: OPENAI_API_KEY not set. Quickstart will fail without it.")
# Create a correctness evaluator using an LLM-as-judge
correctness_evaluator = create_llm_as_judge(
prompt=CORRECTNESS_PROMPT,
model="openai:o3-mini", # 'o3-mini' refers to gpt-3.5-turbo-0125
)
# Define inputs, outputs, and reference outputs for evaluation
inputs = "How much has the price of doodads changed in the past year?"
outputs = "Doodads have increased in price by 10% in the past year."
reference_outputs = "The price of doodads has decreased by 50% in the past year."
# Run the evaluator
eval_result = correctness_evaluator(
inputs=inputs,
outputs=outputs,
reference_outputs=reference_outputs
)
print(eval_result)
# Expected output (score might vary slightly based on LLM, but structure is consistent):
# { 'key': 'score', 'score': False, 'comment': 'The provided answer stated that doodads increased in price by 10%, which conflicts with the reference output...' }