datasketch
datasketch is a Python library that provides probabilistic data structures for efficient similarity search and approximate nearest neighbor (ANN) computations on very large datasets. It currently stands at version 1.9.0 and maintains an active release cadence, with updates addressing features, fixes, and dependency management.
Warnings
- gotcha MinHashLSH provides *approximate* nearest neighbors, not exact. The `threshold` parameter guides the search but does not guarantee that all pairs above the threshold will be returned, nor that results strictly adhere to the threshold. It retrieves *candidates* that are likely to be similar.
- gotcha Using `RedisMinHashLSH` or its asynchronous counterpart requires installing the `redis` and/or `aioredis` packages separately (e.g., `pip install datasketch[redis]`). Attempting to use these classes without the required backend will result in an `ImportError`.
- gotcha The `num_perm` parameter (number of permutations) used to initialize `MinHash` objects and `MinHashLSH` indexes must be consistent. Mismatched `num_perm` values will lead to incorrect similarity estimations or errors when querying the LSH structure.
Install
-
pip install datasketch -
pip install datasketch[redis] -
pip install datasketch[hnsw]
Imports
- MinHash
from datasketch import MinHash
- MinHashLSH
from datasketch import MinHashLSH
- MinHashLSHForest
from datasketch import MinHashLSHForest
- WeightedMinHashGenerator
from datasketch import WeightedMinHashGenerator
- bBitMinHash
from datasketch import bBitMinHash
- MinHashLSHDeletionSession
from datasketch import MinHashLSHDeletionSession
Quickstart
from datasketch import MinHash, MinHashLSH
# Create MinHash objects for two sets
s1 = {"minhash", "is", "a", "probabilistic", "data", "structure", "for", "estimating", "similarity", "between", "sets"}
s2 = {"minhash", "is", "a", "data", "structure", "for", "estimating", "similarity", "between", "documents"}
s3 = {"today", "is", "a", "beautiful", "day"}
m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
for d in s1:
m1.update(d.encode('utf8'))
for d in s2:
m2.update(d.encode('utf8'))
for d in s3:
m3.update(d.encode('utf8'))
# Create an LSH index with a threshold
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m1", m1)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
# Query the LSH for candidates similar to m2
print(f"Candidate keys for m2: {lsh.query(m2)}")