Pydantic Evals

1.78.0 · active · verified Thu Apr 09

Pydantic Evals is a framework for defining and executing evaluations of stochastic code, particularly useful for LLM-based applications. It allows users to create datasets, define custom evaluators, and run evaluations to assess model performance and behavior. It is part of the broader Pydantic AI ecosystem, currently at version 1.78.0, with a rapid release cadence reflecting active development.

Warnings

Install

Imports

Quickstart

This example demonstrates how to define a mock LLM provider, a simple keyword-based evaluator, create a dataset of test cases, and run an evaluation. For real LLM interactions, replace `MockLLM` with an actual `pydantic-ai` LLM client and ensure API keys are set.

import os
from pydantic_evals import Dataset, Evaluation, Evaluator, LLMProvider, Case
from typing import ClassVar

# 1. Define your LLM provider (mocked for a runnable example without API keys)
class MockLLM(LLMProvider):
    name: ClassVar[str] = "mock_llm"
    model_name: ClassVar[str] = "mock_model"

    def get_completion(self, prompt: str) -> str:
        if "capital of France" in prompt:
            return "The capital of France is Paris."
        elif "Python programming" in prompt:
            return "Python is named after the British sketch comedy group Monty Python."
        return f"Mock LLM response to: {prompt[:50]}..."

# 2. Define your evaluation logic
class SimpleKeywordEvaluator(Evaluator):
    def evaluate_case(self, case: Case, actual_output: str, llm: MockLLM) -> Evaluation:
        # For simplicity, check for specific keywords based on input
        score = 0.0
        if "capital of France" in case.input and "Paris" in actual_output:
            score = 1.0
        elif "Python programming" in case.input and "Python" in actual_output:
            score = 1.0
        return Evaluation(score=score, details={"actual_output": actual_output})

# 3. Create a Dataset with evaluation cases
dataset = Dataset(
    cases=[
        Case(input="What is the capital of France?"),
        Case(input="Tell me a fun fact about Python programming."),
        Case(input="What is 2 + 2?") # This case is designed to fail the evaluator
    ]
)

# 4. Run the evaluation
if __name__ == "__main__":
    llm = MockLLM()
    results = dataset.evaluate(
        llm=llm,
        evaluators=[SimpleKeywordEvaluator()],
        batch_size=1,
        num_workers=1,
    )
    print("\nEvaluation Summary:")
    for result in results:
        print(f"  Input: '{result.case.input}'")
        print(f"  Output: '{result.evaluations[0].details['actual_output']}'")
        print(f"  Score: {result.evaluations[0].score}")
        print(f"  Total Case Score: {result.score}")

view raw JSON →