Open-source Evaluators for LLM Agents
Agentevals is an open-source Python library from Microsoft designed to help developers effectively evaluate the performance of Large Language Model (LLM) agents. It provides a framework for defining custom agents, various types of evaluators (e.g., code execution, human feedback), and structured scenarios for consistent testing. The library is currently in early development (v0.0.9) and is expected to have regular updates with evolving features and APIs.
Common errors
-
ModuleNotFoundError: No module named 'agentevals'
cause The 'agentevals' package is not installed in the current Python environment or is not accessible.fixRun `pip install agentevals` in your terminal to install the library. -
TypeError: __init__() missing 1 required positional argument: 'agent'
cause An evaluator class (e.g., `CodeExecutionEvaluator`) was instantiated without providing a mandatory argument, such as an `agent` instance.fixEnsure all required arguments are passed to the constructor. For example: `evaluator = CodeExecutionEvaluator(agent=my_agent_instance, max_iterations=X)`. -
AttributeError: 'MyCustomAgent' object has no attribute 'run'
cause Your custom agent class, intended to be used with `agentevals`, either does not inherit from `agentevals.agents.CustomAgent` or does not implement the required `run` method, or implemented it with an incorrect signature.fixEnsure your custom agent class inherits `agentevals.agents.CustomAgent` and implements a `run` method with the signature `run(self, input_data: dict) -> dict`.
Warnings
- gotcha Agentevals is explicitly noted as being in 'early development'. This means API interfaces, class names, and method signatures are subject to frequent changes without strict adherence to semantic versioning for minor releases (e.g., `0.x.x` to `0.y.x`).
- breaking Due to its early development stage, `0.x.x` releases (e.g., upgrading from `0.0.8` to `0.0.9`) can introduce breaking changes. This often includes method renames, argument signature changes, or class restructurings that are not always explicitly called out in patch notes.
- gotcha Many evaluators and scenarios require specific arguments to be passed during instantiation or evaluation. Forgetting or providing incorrect arguments (e.g., `agent` for evaluators, `input_data` for scenarios) will lead to runtime errors, often `TypeError` or `ValueError`.
Install
-
pip install agentevals
Imports
- CustomAgent
from agentevals.agents import CustomAgent
- CodeExecutionEvaluator
from agentevals.evaluators import CodeExecutionEvaluator
- HumanFeedbackScenario
from agentevals.scenarios import HumanFeedbackScenario
Quickstart
import json
from agentevals.agents import CustomAgent
from agentevals.evaluators import CodeExecutionEvaluator
from agentevals.scenarios import HumanFeedbackScenario
# 1. Define your custom agent by inheriting from CustomAgent
# and implementing the `run` method.
class MySimpleAgent(CustomAgent):
def run(self, input_data: dict) -> dict:
task = input_data.get("task", "no task specified")
# Simulate an agent processing a task and returning an output
if "math problem" in task:
return {"output": "I processed a math problem!"}
return {"output": f"Agent processed task: '{task}'"}
# 2. Instantiate your agent
agent_instance = MySimpleAgent(name="my-eval-agent")
# 3. Instantiate an evaluator, associating it with your agent.
# CodeExecutionEvaluator is one type; others exist in `agentevals.evaluators`.
evaluator = CodeExecutionEvaluator(agent=agent_instance, max_iterations=1)
# 4. Define a scenario that provides input for your agent.
# HumanFeedbackScenario is one type; others exist in `agentevals.scenarios`.
scenario_data = {
"task": "Solve a simple math problem"
}
evaluation_scenario = HumanFeedbackScenario(
scenario_id="math_scenario_1",
input_data=scenario_data,
# expected_output is optional and its usage depends on the specific evaluator.
expected_output={"result": "Solution to math problem"}
)
# 5. Run the evaluation
results = evaluator.evaluate(scenario=evaluation_scenario)
# Print the structured results
print(json.dumps(results, indent=2))