{"id":2869,"library":"autoevals","title":"Autoevals","description":"Autoevals is a universal library for quickly and easily evaluating AI model outputs. Developed by the team at Braintrust, it bundles together a variety of automatic evaluation methods including LLM-as-a-judge, heuristic (e.g., Levenshtein distance), and statistical (e.g., BLEU) evaluations. Currently at version 0.2.0, the library is actively maintained with frequent updates.","status":"active","version":"0.2.0","language":"en","source_language":"en","source_url":"https://github.com/braintrustdata/autoevals","tags":["AI","LLM","evaluation","metrics","machine learning","NLP","testing"],"install":[{"cmd":"pip install autoevals","lang":"bash","label":"Install autoevals"}],"dependencies":[{"reason":"Required for using LLM-as-a-judge evaluation methods (e.g., Factuality, Relevance), which rely on the OpenAI API or compatible services. Compatible with both OpenAI Python SDK v0.x and v1.x.","package":"openai","optional":true}],"imports":[{"note":"Commonly used for LLM-as-a-judge evaluations related to factual accuracy.","symbol":"Factuality","correct":"from autoevals.llm import Factuality"},{"note":"Another common LLM-as-a-judge evaluator for assessing output relevance.","symbol":"Relevance","correct":"from autoevals.llm import Relevance"},{"note":"For heuristic evaluations based on Levenshtein distance.","symbol":"Levenshtein","correct":"from autoevals import Levenshtein"},{"note":"For evaluating numerical differences with a normalized score.","symbol":"NumericDiff","correct":"from autoevals.number import NumericDiff"}],"quickstart":{"code":"import os\nimport asyncio\nfrom autoevals.llm import Factuality\nfrom autoevals import Levenshtein\nfrom autoevals.number import NumericDiff\n\n# Set up your OpenAI API key (or compatible service) for LLM evaluators\n# In a production environment, ensure this is loaded securely from an environment variable.\nos.environ['OPENAI_API_KEY'] = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY_HERE')\n\nasync def main():\n    # Example 1: LLM-as-a-judge evaluation (requires an LLM API key)\n    print(\"--- LLM Factuality Evaluation ---\")\n    factuality_evaluator = Factuality()\n    input_text = \"Which country has the highest population?\"\n    output_text = \"People's Republic of China\"\n    expected_text = \"China\"\n\n    # Synchronous evaluation\n    if os.environ['OPENAI_API_KEY'] != 'YOUR_OPENAI_API_KEY_HERE':\n        llm_result_sync = factuality_evaluator(output_text, expected_text, input=input_text)\n        print(f\"Factuality score (sync): {llm_result_sync.score}\")\n        print(f\"Factuality rationale (sync): {llm_result_sync.metadata.get('rationale')}\")\n\n        # Asynchronous evaluation\n        llm_result_async = await factuality_evaluator.eval_async(output_text, expected_text, input=input_text)\n        print(f\"Factuality score (async): {llm_result_async.score}\")\n        print(f\"Factuality rationale (async): {llm_result_async.metadata.get('rationale')}\")\n    else:\n        print(\"Skipping LLM Factuality evaluation: OPENAI_API_KEY not set.\")\n\n    # Example 2: Heuristic evaluation (Levenshtein distance)\n    print(\"\\n--- Levenshtein Distance Evaluation ---\")\n    levenshtein_evaluator = Levenshtein()\n    output_str = \"hello world\"\n    expected_str = \"hallo world\"\n    lev_result = levenshtein_evaluator(output_str, expected_str)\n    print(f\"Levenshtein score: {lev_result.score}\")\n    print(f\"Levenshtein metadata: {lev_result.metadata}\")\n\n    # Example 3: Numeric difference evaluation\n    print(\"\\n--- Numeric Difference Evaluation ---\")\n    numeric_evaluator = NumericDiff()\n    output_num = 105\n    expected_num = 100\n    num_result = numeric_evaluator(output_num, expected_num)\n    print(f\"NumericDiff score: {num_result.score}\")\n    print(f\"NumericDiff metadata: {num_result.metadata}\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n","lang":"python","description":"This quickstart demonstrates how to use various autoevals scorers, including an LLM-as-a-judge evaluator (Factuality), a heuristic evaluator (Levenshtein), and a numeric evaluator (NumericDiff). It covers both synchronous and asynchronous evaluation patterns for LLM-based scorers and includes a placeholder for the OpenAI API key, which is essential for LLM evaluations."},"warnings":[{"fix":"Set `OPENAI_API_KEY` in your environment variables. For custom endpoints, also set `OPENAI_BASE_URL`.","message":"LLM-as-a-judge evaluation methods (e.g., `Factuality`, `Relevance`) require an OpenAI API key (or a compatible API endpoint). Ensure the `OPENAI_API_KEY` environment variable is set. If `OPENAI_BASE_URL` is not specified, it defaults to an internal AI proxy.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure you `pip install autoevals` (from `braintrustdata`) and use import paths like `from autoevals.llm import Factuality`.","message":"Be aware of potential name collisions: this `autoevals` library (from Braintrust) is distinct from other Python packages like `auto-eval` (a CLI tool), `autoevaluator` (another LLM evaluation framework), or `oak-ai-autoeval-tools`, as well as unrelated robotics projects named 'AutoEval'. Always verify the package source and import paths.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always provide the `expected` argument when using `NumericDiff`, e.g., `numeric_evaluator(output=105, expected=100)`.","message":"The `NumericDiff` evaluator requires an `expected` value to be passed during evaluation; otherwise, it will raise a `ValueError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the official Autoevals documentation for the correct API usage and examples.","message":"While many evaluation concepts are adapted from OpenAI's `evals` project, Autoevals implements them to offer greater flexibility for individual examples, prompt tweaking, and output debugging. Users accustomed to OpenAI's `evals` might find the API and usage patterns different.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}