{"id":3911,"library":"bm25s","title":"bm25s","description":"bm25s (BM25-Sparse) is an ultra-fast implementation of the BM25 lexical search algorithm in pure Python, primarily leveraging NumPy for sparse matrix operations. It focuses on high performance and low dependency, providing significant speedups over other Python implementations. The library is actively developed, with version 0.3.3 being the latest release, and receives regular updates including new features and performance enhancements.","status":"active","version":"0.3.3","language":"en","source_language":"en","source_url":"https://github.com/xhluca/bm25s","tags":["search","ranking","sparse matrix","information retrieval","bm25","nlp","fast"],"install":[{"cmd":"pip install bm25s","lang":"bash","label":"Base installation"},{"cmd":"pip install bm25s[full]","lang":"bash","label":"Recommended: with all extra dependencies (stemming, Numba, etc.)"},{"cmd":"pip install bm25s[indexing] # for SciPy's CSC builder if preferred\npip install jax[cpu] # for JAX-based top-k selection speedup","lang":"bash","label":"Optional: Specific backends/optimizations"}],"dependencies":[{"reason":"Core dependency for sparse matrix operations.","package":"numpy"},{"reason":"Optional dependency for its CSC matrix builder (was required before 0.3.0).","package":"scipy","optional":true},{"reason":"Optional dependency for efficient stemming, recommended for better search results.","package":"PyStemmer","optional":true},{"reason":"Optional dependency for JIT compilation, providing speedups for certain operations.","package":"numba","optional":true},{"reason":"Optional dependency for accelerated top-k selection.","package":"jax","optional":true},{"reason":"Optional dependency for enhanced Command-Line Interface (CLI) UI.","package":"rich","optional":true}],"imports":[{"symbol":"BM25","correct":"from bm25s import BM25"},{"symbol":"tokenize","correct":"from bm25s import tokenize"},{"note":"As of v0.3.0, `selection` was renamed internally to `selection_np` to avoid conflicts and signify its NumPy-based nature. Direct imports should be updated.","wrong":"from bm25s import selection","symbol":"selection","correct":"from bm25s import selection_np"}],"quickstart":{"code":"import bm25s\nimport Stemmer # Ensure 'pip install PyStemmer' is run for this\n\ncorpus = [\n    \"a cat is a feline and likes to purr\",\n    \"a dog is the human's best friend and loves to play\",\n    \"a bird is a beautiful animal that can fly\",\n    \"a fish is a creature that lives in water and swims\",\n]\n\n# Optional: create a stemmer and tokenizer\n# For optimal results, ensure PyStemmer is installed or provide your own tokenizer.\nstemmer = Stemmer.Stemmer(\"english\")\ntokenized_corpus = bm25s.tokenize(corpus, stemmer=stemmer, stopwords=\"english\")\n\n# Create the BM25 model and index the corpus\nretriever = bm25s.BM25(corpus=corpus)\nretriever.index(tokenized_corpus)\n\n# Query the corpus and get top-k results\nquery = \"does the fish purr like a cat?\"\ntokenized_query = bm25s.tokenize(query, stemmer=stemmer, stopwords=\"english\")\nresults, scores = retriever.retrieve(tokenized_query, k=2)\n\n# Print the ranked results\nprint(f\"Query: '{query}'\")\nprint(\"Top results:\")\nfor i in range(results.shape[0]):\n    doc_id = results[i, 0]\n    score = scores[i, 0]\n    print(f\"  (score: {score:.2f}): {corpus[doc_id]}\")","lang":"python","description":"This quickstart demonstrates how to initialize BM25S, tokenize a corpus (with optional stemming and stopword removal), index the documents, and perform a query to retrieve the top-k most relevant documents. It showcases the core `BM25` class and `tokenize` utility."},"warnings":[{"fix":"Explicitly install `scipy` and pass `csc_backend=\"scipy\"` to `BM25()`, or install with `pip install bm25s[indexing]`.","message":"Starting with version 0.3.0, `scipy` is no longer a required dependency. The library now uses a pure NumPy-based CSC matrix builder by default. If you relied on SciPy's CSC builder, you must install `scipy` separately and pass `csc_backend=\"scipy\"` to the `BM25()` constructor, or install `bm25s` with the `[indexing]` extra (e.g., `pip install bm25s[indexing]`).","severity":"breaking","affected_versions":">=0.3.0"},{"fix":"Update import statements from `from bm25s import selection` to `from bm25s import selection_np`.","message":"The internal import path for the `selection` module changed in version 0.3.0. If you were directly importing `selection` (e.g., `from bm25s import selection`), you should update your imports to `from bm25s import selection_np`.","severity":"breaking","affected_versions":">=0.3.0"},{"fix":"Consider `pip install bm25` for a higher-level API, or explicitly manage `bm25s` dependencies (like `PyStemmer`) for advanced control.","message":"For simpler, beginner-friendly usage and a high-level API that includes all necessary dependencies, the project now recommends using the separate `bm25` package (i.e., `pip install bm25`). Users should be aware of the distinction if seeking a fully batteries-included experience.","severity":"gotcha","affected_versions":"All"},{"fix":"Install recommended optional dependencies like `pip install PyStemmer`, `pip install numba`, or `pip install jax[cpu]` based on your performance and feature needs.","message":"Achieving optimal performance and relevance often requires optional dependencies such as `PyStemmer` for stemming, `numba` for JIT compilation, or `jax[cpu]` for accelerated top-k selection. Without these, performance or search quality might not meet expectations, especially on large datasets.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}