Open-source Evaluators for LLM Agents

0.0.9 · active · verified Thu Apr 16

Agentevals is an open-source Python library from Microsoft designed to help developers effectively evaluate the performance of Large Language Model (LLM) agents. It provides a framework for defining custom agents, various types of evaluators (e.g., code execution, human feedback), and structured scenarios for consistent testing. The library is currently in early development (v0.0.9) and is expected to have regular updates with evolving features and APIs.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define a custom LLM agent, instantiate an evaluator, create a scenario with input data, and run an evaluation. The output shows a JSON representation of the evaluation results.

import json
from agentevals.agents import CustomAgent
from agentevals.evaluators import CodeExecutionEvaluator
from agentevals.scenarios import HumanFeedbackScenario

# 1. Define your custom agent by inheriting from CustomAgent
#    and implementing the `run` method.
class MySimpleAgent(CustomAgent):
    def run(self, input_data: dict) -> dict:
        task = input_data.get("task", "no task specified")
        # Simulate an agent processing a task and returning an output
        if "math problem" in task:
            return {"output": "I processed a math problem!"}
        return {"output": f"Agent processed task: '{task}'"}

# 2. Instantiate your agent
agent_instance = MySimpleAgent(name="my-eval-agent")

# 3. Instantiate an evaluator, associating it with your agent.
#    CodeExecutionEvaluator is one type; others exist in `agentevals.evaluators`.
evaluator = CodeExecutionEvaluator(agent=agent_instance, max_iterations=1)

# 4. Define a scenario that provides input for your agent.
#    HumanFeedbackScenario is one type; others exist in `agentevals.scenarios`.
scenario_data = {
    "task": "Solve a simple math problem"
}
evaluation_scenario = HumanFeedbackScenario(
    scenario_id="math_scenario_1",
    input_data=scenario_data,
    # expected_output is optional and its usage depends on the specific evaluator.
    expected_output={"result": "Solution to math problem"}
)

# 5. Run the evaluation
results = evaluator.evaluate(scenario=evaluation_scenario)

# Print the structured results
print(json.dumps(results, indent=2))

view raw JSON →