bm25s
bm25s (BM25-Sparse) is an ultra-fast implementation of the BM25 lexical search algorithm in pure Python, primarily leveraging NumPy for sparse matrix operations. It focuses on high performance and low dependency, providing significant speedups over other Python implementations. The library is actively developed, with version 0.3.3 being the latest release, and receives regular updates including new features and performance enhancements.
Warnings
- breaking Starting with version 0.3.0, `scipy` is no longer a required dependency. The library now uses a pure NumPy-based CSC matrix builder by default. If you relied on SciPy's CSC builder, you must install `scipy` separately and pass `csc_backend="scipy"` to the `BM25()` constructor, or install `bm25s` with the `[indexing]` extra (e.g., `pip install bm25s[indexing]`).
- breaking The internal import path for the `selection` module changed in version 0.3.0. If you were directly importing `selection` (e.g., `from bm25s import selection`), you should update your imports to `from bm25s import selection_np`.
- gotcha For simpler, beginner-friendly usage and a high-level API that includes all necessary dependencies, the project now recommends using the separate `bm25` package (i.e., `pip install bm25`). Users should be aware of the distinction if seeking a fully batteries-included experience.
- gotcha Achieving optimal performance and relevance often requires optional dependencies such as `PyStemmer` for stemming, `numba` for JIT compilation, or `jax[cpu]` for accelerated top-k selection. Without these, performance or search quality might not meet expectations, especially on large datasets.
Install
-
pip install bm25s -
pip install bm25s[full] -
pip install bm25s[indexing] # for SciPy's CSC builder if preferred pip install jax[cpu] # for JAX-based top-k selection speedup
Imports
- BM25
from bm25s import BM25
- tokenize
from bm25s import tokenize
- selection
from bm25s import selection_np
Quickstart
import bm25s
import Stemmer # Ensure 'pip install PyStemmer' is run for this
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
"a bird is a beautiful animal that can fly",
"a fish is a creature that lives in water and swims",
]
# Optional: create a stemmer and tokenizer
# For optimal results, ensure PyStemmer is installed or provide your own tokenizer.
stemmer = Stemmer.Stemmer("english")
tokenized_corpus = bm25s.tokenize(corpus, stemmer=stemmer, stopwords="english")
# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(tokenized_corpus)
# Query the corpus and get top-k results
query = "does the fish purr like a cat?"
tokenized_query = bm25s.tokenize(query, stemmer=stemmer, stopwords="english")
results, scores = retriever.retrieve(tokenized_query, k=2)
# Print the ranked results
print(f"Query: '{query}'")
print("Top results:")
for i in range(results.shape[0]):
doc_id = results[i, 0]
score = scores[i, 0]
print(f" (score: {score:.2f}): {corpus[doc_id]}")