BM25 Ranking Algorithms

0.2.2 · active · verified Thu Apr 09

Provides various BM25 algorithms (BM25Okapi, BM25L, BM25Plus) for document ranking based on a corpus of tokenized documents. It's currently at version 0.2.2 and appears to have an infrequent release cadence, with the latest update adding support for non-iterable corpuses.

Warnings

Install

Imports

Quickstart

This example demonstrates how to initialize BM25Okapi with a tokenized corpus and then retrieve scores and top-N documents for a given tokenized query.

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there, this is a document.",
    "This document is about BM25.",
    "Hello, how are you today?",
    "BM25 is a ranking algorithm.",
]

# Tokenize the corpus (essential step)
tokenized_corpus = [doc.lower().split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "BM25 ranking algorithm"
tokenized_query = query.lower().split(" ")

doc_scores = bm25.get_scores(tokenized_query)
print(f"Document scores: {doc_scores}")

top_n = bm25.get_top_n(tokenized_query, corpus, n=2)
print(f"Top 2 documents: {top_n}")

view raw JSON →