pytrec-eval-terrier

0.5.10 · active · verified Mon Apr 13

pytrec-eval-terrier provides Python bindings for common Information Retrieval evaluation measures, leveraging the highly optimized `trec_eval` C library. It simplifies the process of evaluating ranking performance for search systems. The current version is 0.5.10, and releases occur periodically, often tied to Python version support or minor bug fixes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `pytrec_eval` to evaluate a set of runs against relevance judgments. It shows how to prepare QRELs and runs as dictionaries, select evaluation measures, instantiate the `RelevanceEvaluator`, and compute both per-query and aggregated results.

import pytrec_eval

# Example QRELS (Query Relevance Judgments)
# Format: {query_id: {doc_id: relevance_score}}
qrels = {
    'q1': {'d1': 1, 'd2': 0, 'd3': 1},
    'q2': {'d4': 1, 'd5': 0}
}

# Example RUNS (System Rankings)
# Format: {query_id: {doc_id: score}}
runs = {
    'q1': {'d1': 0.9, 'd2': 0.8, 'd4': 0.7},
    'q2': {'d4': 0.95, 'd6': 0.85}
}

# Define measures to evaluate
measures = pytrec_eval.supported_measures
# Or a specific set: measures = {'map', 'ndcg_cut.10', 'recip_rank'}

# Instantiate the evaluator
evaluator = pytrec_eval.RelevanceEvaluator(qrels, measures)

# Evaluate the runs
results = evaluator.evaluate(runs)

# Print results for a specific query and measure
print(f"MAP for q1: {results['q1']['map']:.4f}")
print(f"NDCG@10 for q2: {results['q2']['ndcg_cut_10']:.4f}")

# Print average results across all queries
agg_results = pytrec_eval.compute_aggregated_results(evaluator, results, measures)
print(f"Average MAP: {agg_results['map']:.4f}")

view raw JSON →