Pydantic Evals
Pydantic Evals is a framework for defining and executing evaluations of stochastic code, particularly useful for LLM-based applications. It allows users to create datasets, define custom evaluators, and run evaluations to assess model performance and behavior. It is part of the broader Pydantic AI ecosystem, currently at version 1.78.0, with a rapid release cadence reflecting active development.
Warnings
- gotcha While `pydantic-evals` provides the evaluation framework, interacting with actual LLM APIs (e.g., OpenAI, Anthropic) typically requires installing the main `pydantic-ai` package and its provider-specific extras (e.g., `pip install pydantic-ai[openai]`), along with setting up API keys (e.g., `OPENAI_API_KEY`).
- breaking The `pydantic-ai` ecosystem, including `pydantic-evals`, is under rapid and active development. API interfaces, especially for `Evaluator` and `LLMProvider` implementations, may evolve quickly even within minor version increments, potentially requiring code adjustments when upgrading.
- gotcha Performance of evaluations can vary significantly based on `batch_size`, `num_workers`, and the complexity of `Evaluator` implementations. Large datasets or slow LLM interactions can lead to long evaluation times without proper tuning.
Install
-
pip install pydantic-evals -
pip install pydantic-ai[openai]
Imports
- Dataset
from pydantic_evals import Dataset
- Evaluator
from pydantic_evals import Evaluator
- LLMProvider
from pydantic_evals import LLMProvider
- Case
from pydantic_evals import Case
- Evaluation
from pydantic_evals import Evaluation
Quickstart
import os
from pydantic_evals import Dataset, Evaluation, Evaluator, LLMProvider, Case
from typing import ClassVar
# 1. Define your LLM provider (mocked for a runnable example without API keys)
class MockLLM(LLMProvider):
name: ClassVar[str] = "mock_llm"
model_name: ClassVar[str] = "mock_model"
def get_completion(self, prompt: str) -> str:
if "capital of France" in prompt:
return "The capital of France is Paris."
elif "Python programming" in prompt:
return "Python is named after the British sketch comedy group Monty Python."
return f"Mock LLM response to: {prompt[:50]}..."
# 2. Define your evaluation logic
class SimpleKeywordEvaluator(Evaluator):
def evaluate_case(self, case: Case, actual_output: str, llm: MockLLM) -> Evaluation:
# For simplicity, check for specific keywords based on input
score = 0.0
if "capital of France" in case.input and "Paris" in actual_output:
score = 1.0
elif "Python programming" in case.input and "Python" in actual_output:
score = 1.0
return Evaluation(score=score, details={"actual_output": actual_output})
# 3. Create a Dataset with evaluation cases
dataset = Dataset(
cases=[
Case(input="What is the capital of France?"),
Case(input="Tell me a fun fact about Python programming."),
Case(input="What is 2 + 2?") # This case is designed to fail the evaluator
]
)
# 4. Run the evaluation
if __name__ == "__main__":
llm = MockLLM()
results = dataset.evaluate(
llm=llm,
evaluators=[SimpleKeywordEvaluator()],
batch_size=1,
num_workers=1,
)
print("\nEvaluation Summary:")
for result in results:
print(f" Input: '{result.case.input}'")
print(f" Output: '{result.evaluations[0].details['actual_output']}'")
print(f" Score: {result.evaluations[0].score}")
print(f" Total Case Score: {result.score}")