bm25s

0.3.3 · active · verified Sat Apr 11

bm25s (BM25-Sparse) is an ultra-fast implementation of the BM25 lexical search algorithm in pure Python, primarily leveraging NumPy for sparse matrix operations. It focuses on high performance and low dependency, providing significant speedups over other Python implementations. The library is actively developed, with version 0.3.3 being the latest release, and receives regular updates including new features and performance enhancements.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize BM25S, tokenize a corpus (with optional stemming and stopword removal), index the documents, and perform a query to retrieve the top-k most relevant documents. It showcases the core `BM25` class and `tokenize` utility.

import bm25s
import Stemmer # Ensure 'pip install PyStemmer' is run for this

corpus = [
    "a cat is a feline and likes to purr",
    "a dog is the human's best friend and loves to play",
    "a bird is a beautiful animal that can fly",
    "a fish is a creature that lives in water and swims",
]

# Optional: create a stemmer and tokenizer
# For optimal results, ensure PyStemmer is installed or provide your own tokenizer.
stemmer = Stemmer.Stemmer("english")
tokenized_corpus = bm25s.tokenize(corpus, stemmer=stemmer, stopwords="english")

# Create the BM25 model and index the corpus
retriever = bm25s.BM25(corpus=corpus)
retriever.index(tokenized_corpus)

# Query the corpus and get top-k results
query = "does the fish purr like a cat?"
tokenized_query = bm25s.tokenize(query, stemmer=stemmer, stopwords="english")
results, scores = retriever.retrieve(tokenized_query, k=2)

# Print the ranked results
print(f"Query: '{query}'")
print("Top results:")
for i in range(results.shape[0]):
    doc_id = results[i, 0]
    score = scores[i, 0]
    print(f"  (score: {score:.2f}): {corpus[doc_id]}")

view raw JSON →