Autoevals
Autoevals is a universal library for quickly and easily evaluating AI model outputs. Developed by the team at Braintrust, it bundles together a variety of automatic evaluation methods including LLM-as-a-judge, heuristic (e.g., Levenshtein distance), and statistical (e.g., BLEU) evaluations. Currently at version 0.2.0, the library is actively maintained with frequent updates.
Warnings
- gotcha LLM-as-a-judge evaluation methods (e.g., `Factuality`, `Relevance`) require an OpenAI API key (or a compatible API endpoint). Ensure the `OPENAI_API_KEY` environment variable is set. If `OPENAI_BASE_URL` is not specified, it defaults to an internal AI proxy.
- gotcha Be aware of potential name collisions: this `autoevals` library (from Braintrust) is distinct from other Python packages like `auto-eval` (a CLI tool), `autoevaluator` (another LLM evaluation framework), or `oak-ai-autoeval-tools`, as well as unrelated robotics projects named 'AutoEval'. Always verify the package source and import paths.
- gotcha The `NumericDiff` evaluator requires an `expected` value to be passed during evaluation; otherwise, it will raise a `ValueError`.
- gotcha While many evaluation concepts are adapted from OpenAI's `evals` project, Autoevals implements them to offer greater flexibility for individual examples, prompt tweaking, and output debugging. Users accustomed to OpenAI's `evals` might find the API and usage patterns different.
Install
-
pip install autoevals
Imports
- Factuality
from autoevals.llm import Factuality
- Relevance
from autoevals.llm import Relevance
- Levenshtein
from autoevals import Levenshtein
- NumericDiff
from autoevals.number import NumericDiff
Quickstart
import os
import asyncio
from autoevals.llm import Factuality
from autoevals import Levenshtein
from autoevals.number import NumericDiff
# Set up your OpenAI API key (or compatible service) for LLM evaluators
# In a production environment, ensure this is loaded securely from an environment variable.
os.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY_HERE')
async def main():
# Example 1: LLM-as-a-judge evaluation (requires an LLM API key)
print("--- LLM Factuality Evaluation ---")
factuality_evaluator = Factuality()
input_text = "Which country has the highest population?"
output_text = "People's Republic of China"
expected_text = "China"
# Synchronous evaluation
if os.environ['OPENAI_API_KEY'] != 'YOUR_OPENAI_API_KEY_HERE':
llm_result_sync = factuality_evaluator(output_text, expected_text, input=input_text)
print(f"Factuality score (sync): {llm_result_sync.score}")
print(f"Factuality rationale (sync): {llm_result_sync.metadata.get('rationale')}")
# Asynchronous evaluation
llm_result_async = await factuality_evaluator.eval_async(output_text, expected_text, input=input_text)
print(f"Factuality score (async): {llm_result_async.score}")
print(f"Factuality rationale (async): {llm_result_async.metadata.get('rationale')}")
else:
print("Skipping LLM Factuality evaluation: OPENAI_API_KEY not set.")
# Example 2: Heuristic evaluation (Levenshtein distance)
print("\n--- Levenshtein Distance Evaluation ---")
levenshtein_evaluator = Levenshtein()
output_str = "hello world"
expected_str = "hallo world"
lev_result = levenshtein_evaluator(output_str, expected_str)
print(f"Levenshtein score: {lev_result.score}")
print(f"Levenshtein metadata: {lev_result.metadata}")
# Example 3: Numeric difference evaluation
print("\n--- Numeric Difference Evaluation ---")
numeric_evaluator = NumericDiff()
output_num = 105
expected_num = 100
num_result = numeric_evaluator(output_num, expected_num)
print(f"NumericDiff score: {num_result.score}")
print(f"NumericDiff metadata: {num_result.metadata}")
if __name__ == "__main__":
asyncio.run(main())