Ragas
RAG evaluation framework — measures faithfulness, answer relevancy, context precision/recall and more. Current version: 0.4.3 (Mar 2026). Still pre-1.0. v0.2 was a major breaking change from v0.1: metrics are now class instances initialized with LLM, evaluate() takes EvaluationDataset not HuggingFace Dataset, answer_relevancy renamed to ResponseRelevancy, fields renamed (question→user_input, answer→response, contexts→retrieved_contexts). Legacy API still works but deprecated — will be removed in v1.0.
Warnings
- breaking v0.2 renamed all field names: question→user_input, answer→response, contexts→retrieved_contexts. Using old field names silently produces empty/wrong evaluations.
- breaking answer_relevancy metric renamed to ResponseRelevancy in v0.2. 'from ragas.metrics import answer_relevancy' still works but is deprecated and will be removed in v1.0.
- breaking evaluate() now takes EvaluationDataset not a HuggingFace Dataset. Passing HuggingFace Dataset directly raises TypeError in v0.2+.
- breaking Metrics must be initialized as class instances with llm= argument. Old pattern of using lowercase singleton (faithfulness, answer_relevancy) deprecated — will be removed in v1.0.
- gotcha All LLM-judge metrics require an async LLM. Ragas uses async internally — synchronous LLM wrappers will cause errors. Use LangchainLLMWrapper or ragas.llms.llm_factory.
- gotcha Context recall (LLMContextRecall) requires a reference (ground truth) field. Running it without reference gives a score of 0 or error.
- gotcha Ragas collects anonymized telemetry by default. Set RAGAS_DO_NOT_TRACK=true to opt out.
Install
-
pip install ragas
Imports
- evaluate (v0.2+ style)
from ragas import EvaluationDataset, SingleTurnSample, evaluate from ragas.metrics import Faithfulness, ResponseRelevancy from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini')) samples = [ SingleTurnSample( user_input='When was the first Super Bowl?', response='The first Super Bowl was held on Jan 15, 1967.', retrieved_contexts=[ 'The First AFL-NFL World Championship Game was played on January 15, 1967.' ] ) ] dataset = EvaluationDataset(samples=samples) result = evaluate( dataset, metrics=[ Faithfulness(llm=llm), ResponseRelevancy(llm=llm) ] ) print(result) - single metric scoring
from ragas import SingleTurnSample from ragas.metrics import Faithfulness from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI import asyncio llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini')) scorer = Faithfulness(llm=llm) sample = SingleTurnSample( user_input='What year was Python created?', response='Python was created in 1991.', retrieved_contexts=['Python was first released in 1991 by Guido van Rossum.'] ) # Async score score = asyncio.run(scorer.single_turn_ascore(sample)) print(score)
Quickstart
# pip install ragas langchain-openai
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import os
os.environ['OPENAI_API_KEY'] = 'your-key'
llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))
samples = [
SingleTurnSample(
user_input='What is the capital of France?',
response='The capital of France is Paris.',
retrieved_contexts=['Paris is the capital and most populous city of France.'],
reference='Paris' # ground truth — needed for recall
)
]
dataset = EvaluationDataset(samples=samples)
result = evaluate(
dataset,
metrics=[
Faithfulness(llm=llm),
ResponseRelevancy(llm=llm),
LLMContextRecall(llm=llm)
]
)
print(result)
# {'faithfulness': 1.0, 'response_relevancy': 0.97, 'context_recall': 1.0}