TREC CAR Tools
trec-car-tools is a Python library (version 2.6, released Feb 1, 2022) providing support for participants in the TREC Complex Answer Retrieval (CAR) track. It offers functionalities for reading and manipulating the TREC CAR dataset, which often comes in CBOR format, including annotations, paragraphs, and outlines. The library's release cadence appears to be tied to major TREC CAR track version releases.
Warnings
- gotcha Data format versions for TREC CAR datasets can change between releases. Ensure you use a version of `trec-car-tools` compatible with your specific dataset version.
- gotcha Anaconda users should install the `cbor` dependency from the `laura-dietz` channel for Python 3.6 to ensure compatibility.
- gotcha The GitHub issue tracker indicates several open issues, some of which suggest potential data parsing or consistency problems within the tools, such as `flat_headings_list is not flat` or `v2.0 dataset para id in manual qrels not found in paragraphCorpus`.
Install
-
pip install trec-car-tools -
conda install laura-dietz::trec-car-tools
Imports
- iter_annotations
from trec_car.read_data import iter_annotations
- iter_paragraphs
from trec_car.read_data import iter_paragraphs
- Page
from trec_car.read_data import Page
- Paragraph
from trec_car.read_data import Paragraph
Quickstart
import os
from trec_car.read_data import iter_annotations, iter_paragraphs
# Assuming 'train.test200.cbor' and 'train.test200.cbor.paragraphs' are available locally
# You would typically download these from the TREC CAR website
# Example 1: Reading annotations (pages file)
annotations_file = os.environ.get('TREC_CAR_ANNOTATIONS_FILE', 'train.test200.cbor')
if os.path.exists(annotations_file):
print(f"\nReading page IDs from {annotations_file}:")
with open(annotations_file, 'rb') as f:
for page in iter_annotations(f):
print(f"Page ID: {page.pageId}, Page Name: {page.pageName}")
# Print first 2 pages only for brevity
if page.pageId and page.pageName: break
else:
print(f"\nSkipping annotation reading: {annotations_file} not found.")
# Example 2: Reading paragraphs file
paragraphs_file = os.environ.get('TREC_CAR_PARAGRAPHS_FILE', 'train.test200.cbor.paragraphs')
if os.path.exists(paragraphs_file):
print(f"\nReading paragraph text from {paragraphs_file}:")
with open(paragraphs_file, 'rb') as f:
for para in iter_paragraphs(f):
print(f"Paragraph ID: {para.paragraphId}, Text: {para.getText()[:100]}...")
# Print first 2 paragraphs only for brevity
if para.paragraphId and para.getText(): break
else:
print(f"\nSkipping paragraph reading: {paragraphs_file} not found.")