TREC CAR Tools

2.6 · active · verified Tue Apr 14

trec-car-tools is a Python library (version 2.6, released Feb 1, 2022) providing support for participants in the TREC Complex Answer Retrieval (CAR) track. It offers functionalities for reading and manipulating the TREC CAR dataset, which often comes in CBOR format, including annotations, paragraphs, and outlines. The library's release cadence appears to be tied to major TREC CAR track version releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to read TREC CAR annotation and paragraph files using `iter_annotations` and `iter_paragraphs` functions. It assumes the dataset files are available locally. The code iterates through the first few pages and paragraphs to show their IDs, names, and text content.

import os
from trec_car.read_data import iter_annotations, iter_paragraphs

# Assuming 'train.test200.cbor' and 'train.test200.cbor.paragraphs' are available locally
# You would typically download these from the TREC CAR website

# Example 1: Reading annotations (pages file)
annotations_file = os.environ.get('TREC_CAR_ANNOTATIONS_FILE', 'train.test200.cbor')
if os.path.exists(annotations_file):
    print(f"\nReading page IDs from {annotations_file}:")
    with open(annotations_file, 'rb') as f:
        for page in iter_annotations(f):
            print(f"Page ID: {page.pageId}, Page Name: {page.pageName}")
            # Print first 2 pages only for brevity
            if page.pageId and page.pageName: break
else:
    print(f"\nSkipping annotation reading: {annotations_file} not found.")

# Example 2: Reading paragraphs file
paragraphs_file = os.environ.get('TREC_CAR_PARAGRAPHS_FILE', 'train.test200.cbor.paragraphs')
if os.path.exists(paragraphs_file):
    print(f"\nReading paragraph text from {paragraphs_file}:")
    with open(paragraphs_file, 'rb') as f:
        for para in iter_paragraphs(f):
            print(f"Paragraph ID: {para.paragraphId}, Text: {para.getText()[:100]}...")
            # Print first 2 paragraphs only for brevity
            if para.paragraphId and para.getText(): break
else:
    print(f"\nSkipping paragraph reading: {paragraphs_file} not found.")

view raw JSON →