IR Datasets

0.5.11 · active · verified Tue Apr 14

ir_datasets provides a common interface to many Information Retrieval (IR) ad-hoc ranking benchmarks, training datasets, and more. It handles downloading, extracting, and providing a unified iterator format for various IR datasets. The library is actively maintained with frequent updates, currently at version 0.5.11, with new datasets and bug fixes released regularly.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a dataset, such as MS MARCO, and iterate through its documents, queries, and relevance judgments using the `ir_datasets.load()` function and its iterator methods.

import ir_datasets

# Load a dataset, e.g., MS-MARCO passage ranking training set
dataset = ir_datasets.load('msmarco-passage/train')

# Iterate through documents
print("First 3 documents:")
for i, doc in enumerate(dataset.docs_iter()):
    print(f"  Doc ID: {doc.doc_id}, Text: {doc.text[:70]}...")
    if i >= 2: break

# Iterate through queries
print("\nFirst 3 queries:")
for i, query in enumerate(dataset.queries_iter()):
    print(f"  Query ID: {query.query_id}, Text: {query.text}")
    if i >= 2: break

# Access relevance judgments (qrels)
print("\nFirst 3 qrels:")
for i, qrel in enumerate(dataset.qrels_iter()):
    print(f"  Query ID: {qrel.query_id}, Doc ID: {qrel.doc_id}, Relevance: {qrel.relevance}")
    if i >= 2: break

view raw JSON →