BM25 Ranking Algorithms
Provides various BM25 algorithms (BM25Okapi, BM25L, BM25Plus) for document ranking based on a corpus of tokenized documents. It's currently at version 0.2.2 and appears to have an infrequent release cadence, with the latest update adding support for non-iterable corpuses.
Warnings
- gotcha The BM25 algorithms expect a pre-tokenized corpus (a list of lists of strings) and tokenized queries, not raw strings. Each sub-list represents a document's tokens.
- gotcha The PyPI package name for installation is `rank-bm25` (with a hyphen), but the Python module you import into your code is `rank_bm25` (with an underscore).
- gotcha Prior to version 0.2.2, passing non-iterable corpuses (e.g., generators) to the BM25 constructor was not officially supported and could lead to unexpected behavior or errors. While 0.2.2 added support, be mindful of generator behavior.
Install
-
pip install rank-bm25
Imports
- BM25Okapi
from rank_bm25 import BM25Okapi
- BM25L
from rank_bm25 import BM25L
- BM25Plus
from rank_bm25 import BM25Plus
Quickstart
from rank_bm25 import BM25Okapi
corpus = [
"Hello there, this is a document.",
"This document is about BM25.",
"Hello, how are you today?",
"BM25 is a ranking algorithm.",
]
# Tokenize the corpus (essential step)
tokenized_corpus = [doc.lower().split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)
query = "BM25 ranking algorithm"
tokenized_query = query.lower().split(" ")
doc_scores = bm25.get_scores(tokenized_query)
print(f"Document scores: {doc_scores}")
top_n = bm25.get_top_n(tokenized_query, corpus, n=2)
print(f"Top 2 documents: {top_n}")