{"id":5775,"library":"ir-datasets","title":"IR Datasets","description":"ir_datasets provides a common interface to many Information Retrieval (IR) ad-hoc ranking benchmarks, training datasets, and more. It handles downloading, extracting, and providing a unified iterator format for various IR datasets. The library is actively maintained with frequent updates, currently at version 0.5.11, with new datasets and bug fixes released regularly.","status":"active","version":"0.5.11","language":"en","source_language":"en","source_url":"https://github.com/allenai/ir_datasets","tags":["information retrieval","datasets","benchmarks","NLP"],"install":[{"cmd":"pip install ir-datasets","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Commonly used for evaluating IR experiments, integrates directly with ir_datasets qrels.","package":"ir-measures","optional":true},{"reason":"Popular IR experimentation toolkit that integrates with ir_datasets.","package":"PyTerrier","optional":true}],"imports":[{"symbol":"ir_datasets","correct":"import ir_datasets"}],"quickstart":{"code":"import ir_datasets\n\n# Load a dataset, e.g., MS-MARCO passage ranking training set\ndataset = ir_datasets.load('msmarco-passage/train')\n\n# Iterate through documents\nprint(\"First 3 documents:\")\nfor i, doc in enumerate(dataset.docs_iter()):\n    print(f\"  Doc ID: {doc.doc_id}, Text: {doc.text[:70]}...\")\n    if i >= 2: break\n\n# Iterate through queries\nprint(\"\\nFirst 3 queries:\")\nfor i, query in enumerate(dataset.queries_iter()):\n    print(f\"  Query ID: {query.query_id}, Text: {query.text}\")\n    if i >= 2: break\n\n# Access relevance judgments (qrels)\nprint(\"\\nFirst 3 qrels:\")\nfor i, qrel in enumerate(dataset.qrels_iter()):\n    print(f\"  Query ID: {qrel.query_id}, Doc ID: {qrel.doc_id}, Relevance: {qrel.relevance}\")\n    if i >= 2: break","lang":"python","description":"This quickstart demonstrates how to load a dataset, such as MS MARCO, and iterate through its documents, queries, and relevance judgments using the `ir_datasets.load()` function and its iterator methods."},"warnings":[{"fix":"Prefer `dataset.docs_iter()`, `dataset.queries_iter()`, and `dataset.qrels_iter()` for memory-efficient processing of large datasets. Only use dictionary-based access like `dataset.qrels_dict()` when you are certain the data will fit in memory.","message":"Converting iterators to dictionaries (e.g., `dataset.qrels_dict()`) will load the entire dataset's relevance judgments into memory. For very large datasets, this can lead to high memory consumption and potential crashes. Use iterators (`dataset.qrels_iter()`) for memory efficiency with large collections.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For stable code, rely on the official Python API methods like `dataset.docs_iter()` as described in the main documentation. If using the Beta API, be aware of its experimental nature and potential for changes.","message":"ir_datasets includes a 'Beta Python API' which offers alternative access patterns (e.g., `dataset.docs` as an iterable object with slicing). This API is experimental, may contain bugs, and is subject to breaking changes in future versions.","severity":"gotcha","affected_versions":"Versions 0.5.0 and later (since beta API introduction)"},{"fix":"Always check the specific dataset's documentation or the output from `ir_datasets` when attempting to load a new collection. Be prepared to follow manual instructions for non-public datasets.","message":"Some datasets are not publicly available and require manual steps (e.g., data usage agreements, local file paths) to access. `ir_datasets` will provide instructions on how to obtain these datasets, but it cannot automate their download in all cases.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z"}