DeepEval
DeepEval is an LLM evaluation framework that helps developers evaluate any LLM workflow, from simple prompt chains to complex multi-step agents. It provides a suite of metrics for various evaluation aspects like relevancy, faithfulness, hallucination, and agentic task completion. Currently at version 3.9.6, the library maintains a frequent release cadence, often introducing new metrics, test case types, and developer experience improvements.
Warnings
- breaking Breaking change in v3.0.8: Conversational test cases must now use a `list[Turn]` instead of `list[LLMTestCase]`.
- breaking Major API overhaul in v3.0: Significant changes for defining complex LLM workflows and agents.
- gotcha DeepEval provides multiple `TestCase` types (`LLMTestCase`, `MLLMTestCase`, `ArenaTestCase`, `Turn`). Using the incorrect `TestCase` type for a specific evaluation scenario (e.g., `LLMTestCase` for multi-turn conversations after v3.0.8) is a common error.
- gotcha Most DeepEval metrics rely on an underlying Large Language Model (LLM) for their evaluation logic, requiring an API key (e.g., `OPENAI_API_KEY`, `COHERE_API_KEY`) to be set.
Install
-
pip install deepeval
Imports
- evaluate
from deepeval import evaluate
- LLMTestCase
from deepeval.test_case import LLMTestCase
- Turn
from deepeval.test_case import Turn
- AnswerRelevancyMetric
from deepeval.metrics import AnswerRelevancyMetric
Quickstart
import os
import asyncio
from deepeval import evaluate
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
# Configure your LLM API key (e.g., OpenAI, Cohere, etc.)
# Most metrics require an LLM to run. Replace with your actual key or set as env var.
# os.environ["OPENAI_API_KEY"] = os.environ.get('OPENAI_API_KEY', 'your_openai_api_key_here')
async def main():
# Define a simple LLM test case
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France.",
expected_output="Paris",
context=["France is a country in Western Europe. Its capital is Paris."],
retrieval_context=["Paris is known for the Eiffel Tower."]
)
# Initialize a metric, e.g., AnswerRelevancyMetric
# Some metrics can take additional parameters or a specific LLM model.
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)
# Run the evaluation
results = await evaluate(
[test_case],
metrics=[answer_relevancy_metric]
)
print("Evaluation Results:")
for result in results:
print(f" Input: {result.input}")
print(f" Actual Output: {result.actual_output}")
for m in result.metrics_results:
print(f" Metric: {m.metric_name}, Score: {m.score}, Pass: {m.success}")
if __name__ == '__main__':
asyncio.run(main())