{"id":7903,"library":"agentevals","title":"Open-source Evaluators for LLM Agents","description":"Agentevals is an open-source Python library from Microsoft designed to help developers effectively evaluate the performance of Large Language Model (LLM) agents. It provides a framework for defining custom agents, various types of evaluators (e.g., code execution, human feedback), and structured scenarios for consistent testing. The library is currently in early development (v0.0.9) and is expected to have regular updates with evolving features and APIs.","status":"active","version":"0.0.9","language":"en","source_language":"en","source_url":"https://github.com/microsoft/agentevals","tags":["LLM","agent","evaluation","AI","Microsoft","testing"],"install":[{"cmd":"pip install agentevals","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"symbol":"CustomAgent","correct":"from agentevals.agents import CustomAgent"},{"symbol":"CodeExecutionEvaluator","correct":"from agentevals.evaluators import CodeExecutionEvaluator"},{"symbol":"HumanFeedbackScenario","correct":"from agentevals.scenarios import HumanFeedbackScenario"}],"quickstart":{"code":"import json\nfrom agentevals.agents import CustomAgent\nfrom agentevals.evaluators import CodeExecutionEvaluator\nfrom agentevals.scenarios import HumanFeedbackScenario\n\n# 1. Define your custom agent by inheriting from CustomAgent\n#    and implementing the `run` method.\nclass MySimpleAgent(CustomAgent):\n    def run(self, input_data: dict) -> dict:\n        task = input_data.get(\"task\", \"no task specified\")\n        # Simulate an agent processing a task and returning an output\n        if \"math problem\" in task:\n            return {\"output\": \"I processed a math problem!\"}\n        return {\"output\": f\"Agent processed task: '{task}'\"}\n\n# 2. Instantiate your agent\nagent_instance = MySimpleAgent(name=\"my-eval-agent\")\n\n# 3. Instantiate an evaluator, associating it with your agent.\n#    CodeExecutionEvaluator is one type; others exist in `agentevals.evaluators`.\nevaluator = CodeExecutionEvaluator(agent=agent_instance, max_iterations=1)\n\n# 4. Define a scenario that provides input for your agent.\n#    HumanFeedbackScenario is one type; others exist in `agentevals.scenarios`.\nscenario_data = {\n    \"task\": \"Solve a simple math problem\"\n}\nevaluation_scenario = HumanFeedbackScenario(\n    scenario_id=\"math_scenario_1\",\n    input_data=scenario_data,\n    # expected_output is optional and its usage depends on the specific evaluator.\n    expected_output={\"result\": \"Solution to math problem\"}\n)\n\n# 5. Run the evaluation\nresults = evaluator.evaluate(scenario=evaluation_scenario)\n\n# Print the structured results\nprint(json.dumps(results, indent=2))","lang":"python","description":"This quickstart demonstrates how to define a custom LLM agent, instantiate an evaluator, create a scenario with input data, and run an evaluation. The output shows a JSON representation of the evaluation results."},"warnings":[{"fix":"Regularly check the official GitHub repository for updates and breaking changes before upgrading. Pin exact versions in `requirements.txt` to prevent unexpected breakages.","message":"Agentevals is explicitly noted as being in 'early development'. This means API interfaces, class names, and method signatures are subject to frequent changes without strict adherence to semantic versioning for minor releases (e.g., `0.x.x` to `0.y.x`).","severity":"gotcha","affected_versions":"All 0.x.x versions"},{"fix":"Always consult the latest README or release notes on GitHub when upgrading to a new `0.x.x` version. Test thoroughly after any version bump and be prepared to adapt your code to new API patterns.","message":"Due to its early development stage, `0.x.x` releases (e.g., upgrading from `0.0.8` to `0.0.9`) can introduce breaking changes. This often includes method renames, argument signature changes, or class restructurings that are not always explicitly called out in patch notes.","severity":"breaking","affected_versions":"All 0.x.x versions"},{"fix":"Refer to the specific class's `__init__` and `evaluate` method signatures in the documentation or source code. Ensure all mandatory parameters are provided with the correct types and values.","message":"Many evaluators and scenarios require specific arguments to be passed during instantiation or evaluation. Forgetting or providing incorrect arguments (e.g., `agent` for evaluators, `input_data` for scenarios) will lead to runtime errors, often `TypeError` or `ValueError`.","severity":"gotcha","affected_versions":"All 0.x.x versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install agentevals` in your terminal to install the library.","cause":"The 'agentevals' package is not installed in the current Python environment or is not accessible.","error":"ModuleNotFoundError: No module named 'agentevals'"},{"fix":"Ensure all required arguments are passed to the constructor. For example: `evaluator = CodeExecutionEvaluator(agent=my_agent_instance, max_iterations=X)`.","cause":"An evaluator class (e.g., `CodeExecutionEvaluator`) was instantiated without providing a mandatory argument, such as an `agent` instance.","error":"TypeError: __init__() missing 1 required positional argument: 'agent'"},{"fix":"Ensure your custom agent class inherits `agentevals.agents.CustomAgent` and implements a `run` method with the signature `run(self, input_data: dict) -> dict`.","cause":"Your custom agent class, intended to be used with `agentevals`, either does not inherit from `agentevals.agents.CustomAgent` or does not implement the required `run` method, or implemented it with an incorrect signature.","error":"AttributeError: 'MyCustomAgent' object has no attribute 'run'"}]}