Ragas
raw JSON → 0.4.3 verified Tue May 12 auth: no python install: stale quickstart: stale
RAG evaluation framework — measures faithfulness, answer relevancy, context precision/recall and more. Current version: 0.4.3 (Mar 2026). Still pre-1.0. v0.2 was a major breaking change from v0.1: metrics are now class instances initialized with LLM, evaluate() takes EvaluationDataset not HuggingFace Dataset, answer_relevancy renamed to ResponseRelevancy, fields renamed (question→user_input, answer→response, contexts→retrieved_contexts). Legacy API still works but deprecated — will be removed in v1.0.
pip install ragas Common errors
error KeyError: 'question' ↓
cause In ragas v0.2+, the required field names for the evaluation dataset were changed. 'question' became 'user_input', 'answer' became 'response', and 'contexts' became 'retrieved_contexts'.
fix
Update the keys in your dataset dictionary to the new names before creating the
EvaluationDataset.
data_samples = {
'user_input': ['What is RAG?'],
'response': ['RAG is Retrieval-Augmented Generation.'],
'retrieved_contexts': [['RAG combines retrieval and generation models.']],
'ground_truth': ['RAG is a technique to improve LLM outputs.']
}
# Then proceed to create EvaluationDataset
# dataset = EvaluationDataset(data_samples) error TypeError: Faithfulness.__init__ missing 1 required positional argument: 'llm' ↓
cause In ragas v0.2+, metrics are no longer simple functions but class instances that must be initialized, typically requiring an LLM argument.
fix
Instantiate the metric class by providing an LLM (often a
RagasLLM wrapper around a Langchain LLM) during initialization.
from ragas.metrics import Faithfulness
from ragas.llms import RagasLLM
from langchain_openai import ChatOpenAI
openai_model = ChatOpenAI(model="gpt-3.5-turbo")
ragas_llm = RagasLLM(llm=openai_model)
# Initialize the metric as a class instance with the LLM
faithfulness_metric = Faithfulness(llm=ragas_llm)
# Then pass the instance to evaluate
# result = evaluate(dataset, metrics=[faithfulness_metric]) error NameError: name 'answer_relevancy' is not defined ↓
cause The `answer_relevancy` metric was renamed to `ResponseRelevancy` in ragas v0.2+.
fix
Use
ResponseRelevancy instead of answer_relevancy and import it as a class.
from ragas.metrics import ResponseRelevancy
# Then instantiate it with an LLM as shown above:
# response_relevancy_metric = ResponseRelevancy(llm=ragas_llm) error ModuleNotFoundError: No module named 'ragas' ↓
cause The `ragas` library is not installed in your current Python environment, or the environment is not activated.
fix
Install
ragas using pip.
pip install ragas
# If you need specific integrations like OpenAI:
pip install ragas[openai] Warnings
breaking v0.2 renamed all field names: question→user_input, answer→response, contexts→retrieved_contexts. Using old field names silently produces empty/wrong evaluations. ↓
fix SingleTurnSample(user_input=..., response=..., retrieved_contexts=[...])
breaking answer_relevancy metric renamed to ResponseRelevancy in v0.2. 'from ragas.metrics import answer_relevancy' still works but is deprecated and will be removed in v1.0. ↓
fix from ragas.metrics import ResponseRelevancy; ResponseRelevancy(llm=llm)
breaking evaluate() now takes EvaluationDataset not a HuggingFace Dataset. Passing HuggingFace Dataset directly raises TypeError in v0.2+. ↓
fix eval_dataset = EvaluationDataset.from_hf_dataset(hf_dataset) then evaluate(eval_dataset, ...)
breaking Metrics must be initialized as class instances with llm= argument. Old pattern of using lowercase singleton (faithfulness, answer_relevancy) deprecated — will be removed in v1.0. ↓
fix Faithfulness(llm=llm) not faithfulness. Pass LLM explicitly to each metric.
gotcha All LLM-judge metrics require an async LLM. Ragas uses async internally — synchronous LLM wrappers will cause errors. Use LangchainLLMWrapper or ragas.llms.llm_factory. ↓
fix from ragas.llms import LangchainLLMWrapper; llm = LangchainLLMWrapper(ChatOpenAI(...))
gotcha Context recall (LLMContextRecall) requires a reference (ground truth) field. Running it without reference gives a score of 0 or error. ↓
fix Include reference='ground truth answer' in SingleTurnSample for recall metrics.
gotcha Ragas collects anonymized telemetry by default. Set RAGAS_DO_NOT_TRACK=true to opt out. ↓
fix export RAGAS_DO_NOT_TRACK=true
breaking When using `ragas` (or its dependencies like `instructor`) with Python 3.9, a `TypeError: unsupported operand type(s) for |: 'type' and 'type'` may occur during module import. This is caused by dependencies utilizing the Python 3.10+ type union syntax (`TypeA | TypeB`) without the necessary `from __future__ import annotations` or `eval_type_backport` package in a Python 3.9 environment. ↓
fix Upgrade your Python environment to 3.10 or newer. If staying on Python 3.9 is strictly required, you might be able to resolve this by installing the `eval_type_backport` package (`pip install eval_type_backport`) if the library's usage allows for it.
breaking Building libraries with C/C++/Cython extensions (like scikit-network) may fail on minimal Docker images like 'alpine' due to missing build essential tools (e.g., g++, make). ↓
fix Ensure build essentials are installed in your Dockerfile (e.g., for Alpine: 'apk add build-base').
Install compatibility stale last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) - - - -
3.10 slim (glibc) - - 5.82s 704M
3.11 alpine (musl) - - - -
3.11 slim (glibc) - - 8.00s 751M
3.12 alpine (musl) - - - -
3.12 slim (glibc) - - 8.54s 727M
3.13 alpine (musl) - - - -
3.13 slim (glibc) - - 8.16s 725M
3.9 alpine (musl) - - - -
3.9 slim (glibc) - - - -
Imports
- evaluate (v0.2+ style) wrong
# v0.1 style — deprecated, removed in v1.0 from datasets import Dataset from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy data = { 'question': ['When was the first Super Bowl?'], 'answer': ['Jan 15, 1967'], 'contexts': [['The game was played on January 15, 1967.']] } ds = Dataset.from_dict(data) result = evaluate(ds, metrics=[faithfulness, answer_relevancy])correctfrom ragas import EvaluationDataset, SingleTurnSample, evaluate from ragas.metrics import Faithfulness, ResponseRelevancy from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini')) samples = [ SingleTurnSample( user_input='When was the first Super Bowl?', response='The first Super Bowl was held on Jan 15, 1967.', retrieved_contexts=[ 'The First AFL-NFL World Championship Game was played on January 15, 1967.' ] ) ] dataset = EvaluationDataset(samples=samples) result = evaluate( dataset, metrics=[ Faithfulness(llm=llm), ResponseRelevancy(llm=llm) ] ) print(result) - single metric scoring wrong
from ragas.metrics import faithfulness # lowercase — deprecated singleton from ragas import evaluate # Using deprecated singleton instancecorrectfrom ragas import SingleTurnSample from ragas.metrics import Faithfulness from ragas.llms import LangchainLLMWrapper from langchain_openai import ChatOpenAI import asyncio llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini')) scorer = Faithfulness(llm=llm) sample = SingleTurnSample( user_input='What year was Python created?', response='Python was created in 1991.', retrieved_contexts=['Python was first released in 1991 by Guido van Rossum.'] ) # Async score score = asyncio.run(scorer.single_turn_ascore(sample)) print(score)
Quickstart stale last tested: 2026-04-23
# pip install ragas langchain-openai
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import os
os.environ['OPENAI_API_KEY'] = 'your-key'
llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))
samples = [
SingleTurnSample(
user_input='What is the capital of France?',
response='The capital of France is Paris.',
retrieved_contexts=['Paris is the capital and most populous city of France.'],
reference='Paris' # ground truth — needed for recall
)
]
dataset = EvaluationDataset(samples=samples)
result = evaluate(
dataset,
metrics=[
Faithfulness(llm=llm),
ResponseRelevancy(llm=llm),
LLMContextRecall(llm=llm)
]
)
print(result)
# {'faithfulness': 1.0, 'response_relevancy': 0.97, 'context_recall': 1.0}