ColBERT AI
ColBERT (Contextualized Late Interaction over BERT) is an advanced neural information retrieval model that enables efficient and effective passage search over large text collections, leveraging fine-grained contextualized late interaction. The library is currently at version 0.2.22 and receives regular updates, focusing on performance, bug fixes, and broader compatibility.
Common errors
-
ImportError: cannot import name 'AdamW' from 'transformers'
cause The `AdamW` optimizer was removed from the `transformers` library in newer versions (e.g., v4.36+). Older `colbert-ai` versions directly importing it will fail.fixUpgrade `colbert-ai` to v0.2.22 or later. Alternatively, downgrade your `transformers` library: `pip install transformers==4.35.2`. -
ninja: build stopped: subcommand failed. Clustering X points in YD to Z clusters... /usr/local/cuda-X.Y/bin/nvcc: not found
cause This typically indicates that the CUDA compiler (`nvcc`) is not found or is not correctly configured in your PATH, which is required for building FAISS extensions or other C++ components during indexing.fixEnsure your CUDA toolkit is properly installed and that `nvcc` is accessible via your system's PATH environment variable. Verify `CUDA_HOME` is set correctly if applicable. -
AssertionError: /path/to/existing/index. See translation.
cause When running indexing, if a directory with the specified index name already exists and is not empty, `colbert-ai` will raise an `AssertionError` to prevent accidental overwrites.fixDelete the existing index directory (e.g., `rm -rf experiments/my_simple_index`) or choose a new, unique `index_name` for your indexing operation. -
ValueError: Invalid pattern: '' can only be an entire path component
cause This error, often originating from `colbert/indexing/loaders.py`, was caused by an incorrect regex flag when processing file paths or collection data in older versions.fixUpgrade `colbert-ai` to version 0.2.22 or newer: `pip install --upgrade colbert-ai`. This version includes a fix for the regex handling.
Warnings
- breaking The `AdamW` optimizer was removed from the `transformers` library in recent versions (e.g., v4.36+). Older versions of `colbert-ai` (prior to 0.2.22) that import `AdamW` directly from `transformers` will break.
- gotcha Installing PyTorch and FAISS (especially `faiss-gpu`) via `pip` can sometimes lead to stability issues or incorrect CUDA configurations. The official ColBERT documentation often recommends using `conda` for these specific dependencies.
- gotcha Indexing large collections can be memory and compute intensive, particularly without GPU acceleration. Failures can occur if CUDA is not correctly configured or if system resources are exhausted during the indexing process.
- bug A bug in `loaders.py` related to regex handling could cause indexing failures with certain collection inputs.
Install
-
pip install colbert-ai -
pip install colbert-ai[torch,faiss-gpu]
Imports
- ColBERTConfig
from colbert.infra import ColBERTConfig
- RunConfig
from colbert.infra import RunConfig
- Run
from colbert.infra import Run
- Indexer
from colbert import Indexer
- Searcher
from colbert import Searcher
- Trainer
from colbert import Trainer
Quickstart
import os
from colbert.infra import ColBERTConfig, RunConfig, Run
from colbert import Indexer, Searcher
# Basic setup for running ColBERT
# You might need to set up a dummy experiment directory
# For real use, ensure a checkpoint exists or is downloaded
# For example, download colbertv2.0 checkpoint via 'wget https://huggingface.co/colbert-ir/colbertv2.0/resolve/main/colbertv2.0.tar.gz'
# A dummy collection and query for demonstration
collection = [
"The quick brown fox jumps over the lazy dog.",
"Artificial intelligence is a rapidly evolving field.",
"Python is a popular programming language for AI and machine learning.",
"Machine learning is a subset of artificial intelligence."
]
queries = ["What is AI?", "Python programming"]
# Configure ColBERT
# Replace 'colbert-ir/colbertv2.0' with a local path if downloaded
COLBERT_CHECKPOINT = os.environ.get('COLBERT_CHECKPOINT', 'colbert-ir/colbertv2.0')
INDEX_ROOT = os.environ.get('COLBERT_INDEX_ROOT', 'experiments')
INDEX_NAME = os.environ.get('COLBERT_INDEX_NAME', 'my_simple_index')
with Run().context(RunConfig(nranks=1, experiment='default')):
config = ColBERTConfig(checkpoint=COLBERT_CHECKPOINT)
# 1. Indexing
indexer = Indexer(checkpoint=COLBERT_CHECKPOINT, config=config, root=INDEX_ROOT)
indexer.index(name=INDEX_NAME, collection=collection)
# 2. Searching
searcher = Searcher(index=INDEX_NAME, config=config, collection=collection, root=INDEX_ROOT)
print(f"\nSearching with query: '{queries[0]}'")
results = searcher.search(queries[0], k=3)
for passage_id, rank, score in zip(*results):
print(f"Passage ID: {passage_id}, Rank: {rank}, Score: {score:.2f}, Text: {collection[passage_id]}")
print(f"\nSearching with query: '{queries[1]}'")
results = searcher.search(queries[1], k=3)
for passage_id, rank, score in zip(*results):
print(f"Passage ID: {passage_id}, Rank: {rank}, Score: {score:.2f}, Text: {collection[passage_id]}")