IR Datasets
ir_datasets provides a common interface to many Information Retrieval (IR) ad-hoc ranking benchmarks, training datasets, and more. It handles downloading, extracting, and providing a unified iterator format for various IR datasets. The library is actively maintained with frequent updates, currently at version 0.5.11, with new datasets and bug fixes released regularly.
Warnings
- gotcha Converting iterators to dictionaries (e.g., `dataset.qrels_dict()`) will load the entire dataset's relevance judgments into memory. For very large datasets, this can lead to high memory consumption and potential crashes. Use iterators (`dataset.qrels_iter()`) for memory efficiency with large collections.
- gotcha ir_datasets includes a 'Beta Python API' which offers alternative access patterns (e.g., `dataset.docs` as an iterable object with slicing). This API is experimental, may contain bugs, and is subject to breaking changes in future versions.
- gotcha Some datasets are not publicly available and require manual steps (e.g., data usage agreements, local file paths) to access. `ir_datasets` will provide instructions on how to obtain these datasets, but it cannot automate their download in all cases.
Install
-
pip install ir-datasets
Imports
- ir_datasets
import ir_datasets
Quickstart
import ir_datasets
# Load a dataset, e.g., MS-MARCO passage ranking training set
dataset = ir_datasets.load('msmarco-passage/train')
# Iterate through documents
print("First 3 documents:")
for i, doc in enumerate(dataset.docs_iter()):
print(f" Doc ID: {doc.doc_id}, Text: {doc.text[:70]}...")
if i >= 2: break
# Iterate through queries
print("\nFirst 3 queries:")
for i, query in enumerate(dataset.queries_iter()):
print(f" Query ID: {query.query_id}, Text: {query.text}")
if i >= 2: break
# Access relevance judgments (qrels)
print("\nFirst 3 qrels:")
for i, qrel in enumerate(dataset.qrels_iter()):
print(f" Query ID: {qrel.query_id}, Doc ID: {qrel.doc_id}, Relevance: {qrel.relevance}")
if i >= 2: break