Sparse Dot Top-N
sparse-dot-topn is a Python package designed to accelerate sparse matrix multiplication followed by the selection of the top-N results. It significantly reduces memory footprint and improves performance for operations common in tasks like large-scale string comparison and entity matching. Developed by ING Wholesale Banking Advanced Analytics, it is currently at version 1.2.0 and receives regular updates with a focus on performance and Python version compatibility.
Common errors
-
cannot import name 'awesome_cossim_topn' from 'sparse_dot_topn'
cause The function `awesome_cossim_topn` was deprecated and replaced by `sp_matmul_topn` in version 1.0.0. It has likely been removed in newer versions.fixUpdate your import statement and function calls to use `sp_matmul_topn` instead. For example: `from sparse_dot_topn import sp_matmul_topn` and `C = sp_matmul_topn(A, B, top_n=N)`. -
Could not build wheels for sparse-dot-topn, which is required to install pyproject.toml-based projects
cause This error typically indicates missing build dependencies (like a C++ compiler, Cython, or compatible NumPy/SciPy versions) required to compile the underlying C++ extension if a pre-built wheel is not available for your system/Python version.fixEnsure you have a C++17 compatible compiler installed (e.g., `build-essential` on Debian/Ubuntu, Xcode Command Line Tools on MacOS, Visual Studio Build Tools on Windows). Also, explicitly install `cython` and `numpy` before `sparse-dot-topn`: `pip install cython numpy scipy sparse-dot-topn`. If issues persist, try specifying a known working NumPy version or installing with `--no-binary sparse_dot_topn`. -
MemoryError: Unable to allocate ...
cause Performing sparse matrix multiplication on extremely large matrices can still exhaust available memory, even with `sparse-dot-topn`'s optimizations.fixFor very large matrices (e.g., O(10M+) rows), consider breaking them into smaller chunks and using the `zip_sp_matmul_topn` function to process them distributedly. Adjusting `top_n` and `threshold` can also help reduce memory footprint.
Warnings
- breaking Python 3.8 support was dropped in version 1.2.0. Ensure you are using Python 3.9 or higher.
- breaking Major API changes in v1.0.0: `awesome_cossim_topn` was deprecated (use `sp_matmul_topn`), `ntop` parameter renamed to `topn`, `lower_bound` to `threshold`, and `use_threads`/`n_jobs` combined into `n_threads`.
- gotcha OpenMP initialization issues, especially on MacOS, can lead to crashes or unexpected behavior due to double initialization or incorrect `rpath` settings.
- gotcha Setting `threshold=None` (the default since v1.0.0) enables pre-computation of non-zero entries, which can reduce memory at a mild performance penalty (~10%). If performance is critical and memory is not an issue, consider setting an explicit `threshold` (e.g., `0.0`).
Install
-
pip install sparse-dot-topn
Imports
- sp_matmul_topn
from sparse_dot_topn import awesome_cossim_topn
from sparse_dot_topn import sp_matmul_topn
- sp_matmul
from sparse_dot_topn import sp_matmul
- zip_sp_matmul_topn
from sparse_dot_topn import zip_sp_matmul_topn
Quickstart
import scipy.sparse as sparse
from sparse_dot_topn import sp_matmul_topn
import numpy as np
# Create two sample sparse matrices (CSR format is recommended for performance)
A = sparse.random(1000, 100, density=0.1, format="csr", random_state=42)
B = sparse.random(100, 2000, density=0.1, format="csr", random_state=42)
# Compute C = A * B and retain the top 10 values per row in C
# sp_matmul_topn also supports `n_threads` for parallel execution
C = sp_matmul_topn(A, B, top_n=10, n_threads=None, threshold=0.0)
print(f"Shape of A: {A.shape}")
print(f"Shape of B: {B.shape}")
print(f"Shape of result C: {C.shape}")
print(f"Number of non-zero elements in C: {C.nnz}")
# print(C)