ColBERT AI

0.2.22 · active · verified Thu Apr 16

ColBERT (Contextualized Late Interaction over BERT) is an advanced neural information retrieval model that enables efficient and effective passage search over large text collections, leveraging fine-grained contextualized late interaction. The library is currently at version 0.2.22 and receives regular updates, focusing on performance, bug fixes, and broader compatibility.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the basic workflow for indexing a small collection of passages and then performing a search using a pre-trained ColBERT model. It covers the `Indexer` for creating a ColBERT index and the `Searcher` for querying that index. Ensure a ColBERT checkpoint is available, either by letting the library download it or by providing a local path.

import os
from colbert.infra import ColBERTConfig, RunConfig, Run
from colbert import Indexer, Searcher

# Basic setup for running ColBERT
# You might need to set up a dummy experiment directory
# For real use, ensure a checkpoint exists or is downloaded
# For example, download colbertv2.0 checkpoint via 'wget https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/colbertv2.0.tar.gz'

# A dummy collection and query for demonstration
collection = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is a rapidly evolving field.",
    "Python is a popular programming language for AI and machine learning.",
    "Machine learning is a subset of artificial intelligence."
]
queries = ["What is AI?", "Python programming"]

# Configure ColBERT
# Replace 'colbert-ir/colbertv2.0' with a local path if downloaded
COLBERT_CHECKPOINT = os.environ.get('COLBERT_CHECKPOINT', 'colbert-ir/colbertv2.0')
INDEX_ROOT = os.environ.get('COLBERT_INDEX_ROOT', 'experiments')
INDEX_NAME = os.environ.get('COLBERT_INDEX_NAME', 'my_simple_index')

with Run().context(RunConfig(nranks=1, experiment='default')):
    config = ColBERTConfig(checkpoint=COLBERT_CHECKPOINT)
    
    # 1. Indexing
    indexer = Indexer(checkpoint=COLBERT_CHECKPOINT, config=config, root=INDEX_ROOT)
    indexer.index(name=INDEX_NAME, collection=collection)
    
    # 2. Searching
    searcher = Searcher(index=INDEX_NAME, config=config, collection=collection, root=INDEX_ROOT)
    
    print(f"\nSearching with query: '{queries[0]}'")
    results = searcher.search(queries[0], k=3)
    for passage_id, rank, score in zip(*results):
        print(f"Passage ID: {passage_id}, Rank: {rank}, Score: {score:.2f}, Text: {collection[passage_id]}")

    print(f"\nSearching with query: '{queries[1]}'")
    results = searcher.search(queries[1], k=3)
    for passage_id, rank, score in zip(*results):
        print(f"Passage ID: {passage_id}, Rank: {rank}, Score: {score:.2f}, Text: {collection[passage_id]}")

view raw JSON →