{"id":1643,"library":"pydantic-evals","title":"Pydantic Evals","description":"Pydantic Evals is a framework for defining and executing evaluations of stochastic code, particularly useful for LLM-based applications. It allows users to create datasets, define custom evaluators, and run evaluations to assess model performance and behavior. It is part of the broader Pydantic AI ecosystem, currently at version 1.78.0, with a rapid release cadence reflecting active development.","status":"active","version":"1.78.0","language":"en","source_language":"en","source_url":"https://github.com/pydantic/pydantic-ai","tags":["LLM","evaluation","AI","pydantic","testing"],"install":[{"cmd":"pip install pydantic-evals","lang":"bash","label":"Install core library"},{"cmd":"pip install pydantic-ai[openai]","lang":"bash","label":"For OpenAI LLM integration (optional but common)"}],"dependencies":[{"reason":"Core data validation and settings management, foundational to the library's design.","package":"pydantic>=2.0","optional":false},{"reason":"Provides actual LLM integrations (e.g., OpenAI, Anthropic) which are commonly used with pydantic-evals. While pydantic-evals can be used with custom LLMProvider implementations, integration with existing LLMs is often done via pydantic-ai.","package":"pydantic-ai","optional":true}],"imports":[{"symbol":"Dataset","correct":"from pydantic_evals import Dataset"},{"symbol":"Evaluator","correct":"from pydantic_evals import Evaluator"},{"symbol":"LLMProvider","correct":"from pydantic_evals import LLMProvider"},{"symbol":"Case","correct":"from pydantic_evals import Case"},{"symbol":"Evaluation","correct":"from pydantic_evals import Evaluation"}],"quickstart":{"code":"import os\nfrom pydantic_evals import Dataset, Evaluation, Evaluator, LLMProvider, Case\nfrom typing import ClassVar\n\n# 1. Define your LLM provider (mocked for a runnable example without API keys)\nclass MockLLM(LLMProvider):\n    name: ClassVar[str] = \"mock_llm\"\n    model_name: ClassVar[str] = \"mock_model\"\n\n    def get_completion(self, prompt: str) -> str:\n        if \"capital of France\" in prompt:\n            return \"The capital of France is Paris.\"\n        elif \"Python programming\" in prompt:\n            return \"Python is named after the British sketch comedy group Monty Python.\"\n        return f\"Mock LLM response to: {prompt[:50]}...\"\n\n# 2. Define your evaluation logic\nclass SimpleKeywordEvaluator(Evaluator):\n    def evaluate_case(self, case: Case, actual_output: str, llm: MockLLM) -> Evaluation:\n        # For simplicity, check for specific keywords based on input\n        score = 0.0\n        if \"capital of France\" in case.input and \"Paris\" in actual_output:\n            score = 1.0\n        elif \"Python programming\" in case.input and \"Python\" in actual_output:\n            score = 1.0\n        return Evaluation(score=score, details={\"actual_output\": actual_output})\n\n# 3. Create a Dataset with evaluation cases\ndataset = Dataset(\n    cases=[\n        Case(input=\"What is the capital of France?\"),\n        Case(input=\"Tell me a fun fact about Python programming.\"),\n        Case(input=\"What is 2 + 2?\") # This case is designed to fail the evaluator\n    ]\n)\n\n# 4. Run the evaluation\nif __name__ == \"__main__\":\n    llm = MockLLM()\n    results = dataset.evaluate(\n        llm=llm,\n        evaluators=[SimpleKeywordEvaluator()],\n        batch_size=1,\n        num_workers=1,\n    )\n    print(\"\\nEvaluation Summary:\")\n    for result in results:\n        print(f\"  Input: '{result.case.input}'\")\n        print(f\"  Output: '{result.evaluations[0].details['actual_output']}'\")\n        print(f\"  Score: {result.evaluations[0].score}\")\n        print(f\"  Total Case Score: {result.score}\")","lang":"python","description":"This example demonstrates how to define a mock LLM provider, a simple keyword-based evaluator, create a dataset of test cases, and run an evaluation. For real LLM interactions, replace `MockLLM` with an actual `pydantic-ai` LLM client and ensure API keys are set."},"warnings":[{"fix":"Install `pydantic-ai` with relevant extras and configure API keys as per `pydantic-ai` documentation.","message":"While `pydantic-evals` provides the evaluation framework, interacting with actual LLM APIs (e.g., OpenAI, Anthropic) typically requires installing the main `pydantic-ai` package and its provider-specific extras (e.g., `pip install pydantic-ai[openai]`), along with setting up API keys (e.g., `OPENAI_API_KEY`).","severity":"gotcha","affected_versions":">=1.0.0"},{"fix":"Refer to the official documentation and release notes before upgrading, and test your evaluation logic thoroughly after updates. Pinning exact versions might be necessary for stability in production.","message":"The `pydantic-ai` ecosystem, including `pydantic-evals`, is under rapid and active development. API interfaces, especially for `Evaluator` and `LLMProvider` implementations, may evolve quickly even within minor version increments, potentially requiring code adjustments when upgrading.","severity":"breaking","affected_versions":">=1.0.0"},{"fix":"Experiment with `batch_size` and `num_workers` in `dataset.evaluate()` for optimal throughput. Consider caching LLM responses or evaluator results for repetitive tests.","message":"Performance of evaluations can vary significantly based on `batch_size`, `num_workers`, and the complexity of `Evaluator` implementations. Large datasets or slow LLM interactions can lead to long evaluation times without proper tuning.","severity":"gotcha","affected_versions":">=1.0.0"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}