Ragas

raw JSON →
0.4.3 verified Tue May 12 auth: no python install: stale quickstart: stale

RAG evaluation framework — measures faithfulness, answer relevancy, context precision/recall and more. Current version: 0.4.3 (Mar 2026). Still pre-1.0. v0.2 was a major breaking change from v0.1: metrics are now class instances initialized with LLM, evaluate() takes EvaluationDataset not HuggingFace Dataset, answer_relevancy renamed to ResponseRelevancy, fields renamed (question→user_input, answer→response, contexts→retrieved_contexts). Legacy API still works but deprecated — will be removed in v1.0.

pip install ragas
error KeyError: 'question'
cause In ragas v0.2+, the required field names for the evaluation dataset were changed. 'question' became 'user_input', 'answer' became 'response', and 'contexts' became 'retrieved_contexts'.
fix
Update the keys in your dataset dictionary to the new names before creating the EvaluationDataset.
data_samples = {
    'user_input': ['What is RAG?'],
    'response': ['RAG is Retrieval-Augmented Generation.'],
    'retrieved_contexts': [['RAG combines retrieval and generation models.']],
    'ground_truth': ['RAG is a technique to improve LLM outputs.']
}
# Then proceed to create EvaluationDataset
# dataset = EvaluationDataset(data_samples)
error TypeError: Faithfulness.__init__ missing 1 required positional argument: 'llm'
cause In ragas v0.2+, metrics are no longer simple functions but class instances that must be initialized, typically requiring an LLM argument.
fix
Instantiate the metric class by providing an LLM (often a RagasLLM wrapper around a Langchain LLM) during initialization.
from ragas.metrics import Faithfulness
from ragas.llms import RagasLLM
from langchain_openai import ChatOpenAI

openai_model = ChatOpenAI(model="gpt-3.5-turbo")
ragas_llm = RagasLLM(llm=openai_model)

# Initialize the metric as a class instance with the LLM
faithfulness_metric = Faithfulness(llm=ragas_llm)

# Then pass the instance to evaluate
# result = evaluate(dataset, metrics=[faithfulness_metric])
error NameError: name 'answer_relevancy' is not defined
cause The `answer_relevancy` metric was renamed to `ResponseRelevancy` in ragas v0.2+.
fix
Use ResponseRelevancy instead of answer_relevancy and import it as a class.
from ragas.metrics import ResponseRelevancy
# Then instantiate it with an LLM as shown above:
# response_relevancy_metric = ResponseRelevancy(llm=ragas_llm)
error ModuleNotFoundError: No module named 'ragas'
cause The `ragas` library is not installed in your current Python environment, or the environment is not activated.
fix
Install ragas using pip.
pip install ragas
# If you need specific integrations like OpenAI:
pip install ragas[openai]
breaking v0.2 renamed all field names: question→user_input, answer→response, contexts→retrieved_contexts. Using old field names silently produces empty/wrong evaluations.
fix SingleTurnSample(user_input=..., response=..., retrieved_contexts=[...])
breaking answer_relevancy metric renamed to ResponseRelevancy in v0.2. 'from ragas.metrics import answer_relevancy' still works but is deprecated and will be removed in v1.0.
fix from ragas.metrics import ResponseRelevancy; ResponseRelevancy(llm=llm)
breaking evaluate() now takes EvaluationDataset not a HuggingFace Dataset. Passing HuggingFace Dataset directly raises TypeError in v0.2+.
fix eval_dataset = EvaluationDataset.from_hf_dataset(hf_dataset) then evaluate(eval_dataset, ...)
breaking Metrics must be initialized as class instances with llm= argument. Old pattern of using lowercase singleton (faithfulness, answer_relevancy) deprecated — will be removed in v1.0.
fix Faithfulness(llm=llm) not faithfulness. Pass LLM explicitly to each metric.
gotcha All LLM-judge metrics require an async LLM. Ragas uses async internally — synchronous LLM wrappers will cause errors. Use LangchainLLMWrapper or ragas.llms.llm_factory.
fix from ragas.llms import LangchainLLMWrapper; llm = LangchainLLMWrapper(ChatOpenAI(...))
gotcha Context recall (LLMContextRecall) requires a reference (ground truth) field. Running it without reference gives a score of 0 or error.
fix Include reference='ground truth answer' in SingleTurnSample for recall metrics.
gotcha Ragas collects anonymized telemetry by default. Set RAGAS_DO_NOT_TRACK=true to opt out.
fix export RAGAS_DO_NOT_TRACK=true
breaking When using `ragas` (or its dependencies like `instructor`) with Python 3.9, a `TypeError: unsupported operand type(s) for |: 'type' and 'type'` may occur during module import. This is caused by dependencies utilizing the Python 3.10+ type union syntax (`TypeA | TypeB`) without the necessary `from __future__ import annotations` or `eval_type_backport` package in a Python 3.9 environment.
fix Upgrade your Python environment to 3.10 or newer. If staying on Python 3.9 is strictly required, you might be able to resolve this by installing the `eval_type_backport` package (`pip install eval_type_backport`) if the library's usage allows for it.
breaking Building libraries with C/C++/Cython extensions (like scikit-network) may fail on minimal Docker images like 'alpine' due to missing build essential tools (e.g., g++, make).
fix Ensure build essentials are installed in your Dockerfile (e.g., for Alpine: 'apk add build-base').
python os / libc status wheel install import disk
3.10 alpine (musl) - - - -
3.10 slim (glibc) - - 5.82s 704M
3.11 alpine (musl) - - - -
3.11 slim (glibc) - - 8.00s 751M
3.12 alpine (musl) - - - -
3.12 slim (glibc) - - 8.54s 727M
3.13 alpine (musl) - - - -
3.13 slim (glibc) - - 8.16s 725M
3.9 alpine (musl) - - - -
3.9 slim (glibc) - - - -

Ragas v0.2+ RAG evaluation with EvaluationDataset and class-based metrics.

# pip install ragas langchain-openai
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import os

os.environ['OPENAI_API_KEY'] = 'your-key'

llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))

samples = [
    SingleTurnSample(
        user_input='What is the capital of France?',
        response='The capital of France is Paris.',
        retrieved_contexts=['Paris is the capital and most populous city of France.'],
        reference='Paris'  # ground truth — needed for recall
    )
]

dataset = EvaluationDataset(samples=samples)

result = evaluate(
    dataset,
    metrics=[
        Faithfulness(llm=llm),
        ResponseRelevancy(llm=llm),
        LLMContextRecall(llm=llm)
    ]
)
print(result)
# {'faithfulness': 1.0, 'response_relevancy': 0.97, 'context_recall': 1.0}