Ragas

0.4.3 verified Tue May 12 auth: no python install: stale quickstart: stale

RAG evaluation framework — measures faithfulness, answer relevancy, context precision/recall and more. Current version: 0.4.3 (Mar 2026). Still pre-1.0. v0.2 was a major breaking change from v0.1: metrics are now class instances initialized with LLM, evaluate() takes EvaluationDataset not HuggingFace Dataset, answer_relevancy renamed to ResponseRelevancy, fields renamed (question→user_input, answer→response, contexts→retrieved_contexts). Legacy API still works but deprecated — will be removed in v1.0.

pip install ragas

Common errors

error KeyError: 'question' ↓

cause In ragas v0.2+, the required field names for the evaluation dataset were changed. 'question' became 'user_input', 'answer' became 'response', and 'contexts' became 'retrieved_contexts'.

fix

Update the keys in your dataset dictionary to the new names before creating the EvaluationDataset.

data_samples = {
    'user_input': ['What is RAG?'],
    'response': ['RAG is Retrieval-Augmented Generation.'],
    'retrieved_contexts': [['RAG combines retrieval and generation models.']],
    'ground_truth': ['RAG is a technique to improve LLM outputs.']
}
# Then proceed to create EvaluationDataset
# dataset = EvaluationDataset(data_samples)

error TypeError: Faithfulness.__init__ missing 1 required positional argument: 'llm' ↓

cause In ragas v0.2+, metrics are no longer simple functions but class instances that must be initialized, typically requiring an LLM argument.

fix

Instantiate the metric class by providing an LLM (often a RagasLLM wrapper around a Langchain LLM) during initialization.

from ragas.metrics import Faithfulness
from ragas.llms import RagasLLM
from langchain_openai import ChatOpenAI

openai_model = ChatOpenAI(model="gpt-3.5-turbo")
ragas_llm = RagasLLM(llm=openai_model)

# Initialize the metric as a class instance with the LLM
faithfulness_metric = Faithfulness(llm=ragas_llm)

# Then pass the instance to evaluate
# result = evaluate(dataset, metrics=[faithfulness_metric])

error NameError: name 'answer_relevancy' is not defined ↓

cause The `answer_relevancy` metric was renamed to `ResponseRelevancy` in ragas v0.2+.

fix

Use ResponseRelevancy instead of answer_relevancy and import it as a class.

from ragas.metrics import ResponseRelevancy
# Then instantiate it with an LLM as shown above:
# response_relevancy_metric = ResponseRelevancy(llm=ragas_llm)

error ModuleNotFoundError: No module named 'ragas' ↓

cause The `ragas` library is not installed in your current Python environment, or the environment is not activated.

fix

Install ragas using pip.

pip install ragas
# If you need specific integrations like OpenAI:
pip install ragas[openai]

Warnings

breaking v0.2 renamed all field names: question→user_input, answer→response, contexts→retrieved_contexts. Using old field names silently produces empty/wrong evaluations. ↓

fix SingleTurnSample(user_input=..., response=..., retrieved_contexts=[...])

breaking answer_relevancy metric renamed to ResponseRelevancy in v0.2. 'from ragas.metrics import answer_relevancy' still works but is deprecated and will be removed in v1.0. ↓

fix from ragas.metrics import ResponseRelevancy; ResponseRelevancy(llm=llm)

breaking evaluate() now takes EvaluationDataset not a HuggingFace Dataset. Passing HuggingFace Dataset directly raises TypeError in v0.2+. ↓

fix eval_dataset = EvaluationDataset.from_hf_dataset(hf_dataset) then evaluate(eval_dataset, ...)

breaking Metrics must be initialized as class instances with llm= argument. Old pattern of using lowercase singleton (faithfulness, answer_relevancy) deprecated — will be removed in v1.0. ↓

fix Faithfulness(llm=llm) not faithfulness. Pass LLM explicitly to each metric.

gotcha All LLM-judge metrics require an async LLM. Ragas uses async internally — synchronous LLM wrappers will cause errors. Use LangchainLLMWrapper or ragas.llms.llm_factory. ↓

fix from ragas.llms import LangchainLLMWrapper; llm = LangchainLLMWrapper(ChatOpenAI(...))

gotcha Context recall (LLMContextRecall) requires a reference (ground truth) field. Running it without reference gives a score of 0 or error. ↓

fix Include reference='ground truth answer' in SingleTurnSample for recall metrics.

gotcha Ragas collects anonymized telemetry by default. Set RAGAS_DO_NOT_TRACK=true to opt out. ↓

fix export RAGAS_DO_NOT_TRACK=true

breaking When using `ragas` (or its dependencies like `instructor`) with Python 3.9, a `TypeError: unsupported operand type(s) for |: 'type' and 'type'` may occur during module import. This is caused by dependencies utilizing the Python 3.10+ type union syntax (`TypeA | TypeB`) without the necessary `from __future__ import annotations` or `eval_type_backport` package in a Python 3.9 environment. ↓

fix Upgrade your Python environment to 3.10 or newer. If staying on Python 3.9 is strictly required, you might be able to resolve this by installing the `eval_type_backport` package (`pip install eval_type_backport`) if the library's usage allows for it.

breaking Building libraries with C/C++/Cython extensions (like scikit-network) may fail on minimal Docker images like 'alpine' due to missing build essential tools (e.g., g++, make). ↓

fix Ensure build essentials are installed in your Dockerfile (e.g., for Alpine: 'apk add build-base').

Install compatibility stale last tested: 2026-05-12

python os / libc status wheel install import disk

3.10 alpine (musl) - - - -

3.10 slim (glibc) - - 5.82s 704M

3.11 alpine (musl) - - - -

3.11 slim (glibc) - - 8.00s 751M

3.12 alpine (musl) - - - -

3.12 slim (glibc) - - 8.54s 727M

3.13 alpine (musl) - - - -

3.13 slim (glibc) - - 8.16s 725M

3.9 alpine (musl) - - - -

3.9 slim (glibc) - - - -

Imports

evaluate (v0.2+ style)

wrong

# v0.1 style — deprecated, removed in v1.0
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy

data = {
    'question': ['When was the first Super Bowl?'],
    'answer': ['Jan 15, 1967'],
    'contexts': [['The game was played on January 15, 1967.']]
}
ds = Dataset.from_dict(data)
result = evaluate(ds, metrics=[faithfulness, answer_relevancy])

correct

from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI

llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))

samples = [
    SingleTurnSample(
        user_input='When was the first Super Bowl?',
        response='The first Super Bowl was held on Jan 15, 1967.',
        retrieved_contexts=[
            'The First AFL-NFL World Championship Game was played on January 15, 1967.'
        ]
    )
]

dataset = EvaluationDataset(samples=samples)

result = evaluate(
    dataset,
    metrics=[
        Faithfulness(llm=llm),
        ResponseRelevancy(llm=llm)
    ]
)
print(result)

v0.1 field names: question/answer/contexts. v0.2+ field names: user_input/response/retrieved_contexts. answer_relevancy renamed to ResponseRelevancy. Metrics are now class instances with llm= argument.

single metric scoring

wrong

from ragas.metrics import faithfulness  # lowercase — deprecated singleton
from ragas import evaluate
# Using deprecated singleton instance

correct

from ragas import SingleTurnSample
from ragas.metrics import Faithfulness
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import asyncio

llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))
scorer = Faithfulness(llm=llm)

sample = SingleTurnSample(
    user_input='What year was Python created?',
    response='Python was created in 1991.',
    retrieved_contexts=['Python was first released in 1991 by Guido van Rossum.']
)

# Async score
score = asyncio.run(scorer.single_turn_ascore(sample))
print(score)

Import Faithfulness (class) not faithfulness (deprecated singleton). Initialize with llm= argument. Use single_turn_ascore() for single samples.

Quickstart stale last tested: 2026-04-23

Ragas v0.2+ RAG evaluation with EvaluationDataset and class-based metrics.

# pip install ragas langchain-openai
from ragas import EvaluationDataset, SingleTurnSample, evaluate
from ragas.metrics import Faithfulness, ResponseRelevancy, LLMContextRecall
from ragas.llms import LangchainLLMWrapper
from langchain_openai import ChatOpenAI
import os

os.environ['OPENAI_API_KEY'] = 'your-key'

llm = LangchainLLMWrapper(ChatOpenAI(model='gpt-4o-mini'))

samples = [
    SingleTurnSample(
        user_input='What is the capital of France?',
        response='The capital of France is Paris.',
        retrieved_contexts=['Paris is the capital and most populous city of France.'],
        reference='Paris'  # ground truth — needed for recall
    )
]

dataset = EvaluationDataset(samples=samples)

result = evaluate(
    dataset,
    metrics=[
        Faithfulness(llm=llm),
        ResponseRelevancy(llm=llm),
        LLMContextRecall(llm=llm)
    ]
)
print(result)
# {'faithfulness': 1.0, 'response_relevancy': 0.97, 'context_recall': 1.0}