datasketch
datasketch is a Python library that provides probabilistic data structures for efficient similarity search and approximate nearest neighbor (ANN) computations on very large datasets. It currently stands at version 1.9.0 and maintains an active release cadence, with updates addressing features, fixes, and dependency management.
Common errors
-
ValueError: If the two MinHashes have different numbers of permutation functions or different seeds.
cause This error occurs when attempting to compute Jaccard similarity or merge two `MinHash` (or `WeightedMinHash`) objects that were initialized with different `num_perm` (number of permutation functions) or `seed` values, which are critical for their internal state consistency.fixEnsure that all `MinHash` or `WeightedMinHash` objects intended for comparison or merging are created with the same `num_perm` and `seed` parameters. For example: ```python from datasketch import MinHash m1 = MinHash(num_perm=128, seed=1) m2 = MinHash(num_perm=128, seed=1) # Update m1 and m2 m1.jaccard(m2) # This will now work ``` -
ModuleNotFoundError: No module named 'datasketch'
cause This error typically means the `datasketch` library is not installed in the Python environment being used, or there's a confusion with another library named `datasketches` (Apache DataSketches).fixInstall the correct library using pip: `pip install datasketch`. If you intended to use the Apache DataSketches library, install it with `pip install datasketches` (note the 'es' at the end) and adjust your imports accordingly. -
ValueError: The num_perm of MinHash out of range
cause This error specifically occurs when adding a `MinHash` object to a `MinHashLSHForest` if the `num_perm` of the `MinHash` is less than `self.k * self.l` (where `k` and `l` are internal parameters derived from `num_perm` and `l` provided during `MinHashLSHForest` initialization). Essentially, the MinHash has too few permutations for the LSH Forest's configuration.fixEnsure the `MinHash` object's `num_perm` is equal to or greater than the `num_perm` expected by the `MinHashLSHForest` instance. The `MinHashLSHForest` is initialized with `num_perm`, and the MinHashes added to it must match this. ```python from datasketch import MinHash, MinHashLSHForest # num_perm for MinHashLSHForest and MinHash must be consistent num_perms = 128 lshensemble = MinHashLSHForest(num_perm=num_perms) m = MinHash(num_perm=num_perms) # ... populate m ... lshensemble.add("key", m) lshensemble.index() ``` -
TypeError: prepickle=False requires bytes keys for non-dict storage, got <type_name>. Either pass bytes keys or use prepickle=True for automatic serialization.
cause When using `MinHashLSH` with non-dict storage (e.g., Redis) and `prepickle` is set to `False`, `datasketch` expects keys to be in bytes format. This error indicates that the provided key is not a bytes object.fixEither convert your keys to bytes explicitly (e.g., `key.encode('utf-8')`) before adding them to `MinHashLSH`, or set `prepickle=True` in the `MinHashLSH` constructor to enable automatic serialization of keys. ```python from datasketch import MinHash, MinHashLSH # Option 1: Convert keys to bytes manually lsh = MinHashLSH(threshold=0.5, num_perm=128, prepickle=False) # Or omit prepickle for default False lsh.insert(b"my_key", MinHash()) # Option 2: Enable automatic pickling (default for storage_config types other than dict) lsh = MinHashLSH(threshold=0.5, num_perm=128, prepickle=True) lsh.insert("my_key", MinHash()) ``` -
MinHashLSHForest not returning results / no result for top-k
cause After adding `MinHash` objects to a `MinHashLSHForest` instance, the `index()` method *must* be called to build the internal data structures that make the keys searchable. Failing to call `index()` means no results will be found during queries.fixAlways call the `.index()` method on your `MinHashLSHForest` instance after adding all your `MinHash` objects and before performing any queries. ```python from datasketch import MinHash, MinHashLSHForest lshf = MinHashLSHForest(num_perm=128) # Add MinHash objects lshf.add("set1", MinHash()) lshf.add("set2", MinHash()) # IMPORTANT: Call index() after adding all items lshf.index() # Now perform queries results = lshf.query(MinHash(), k=1) ```
Warnings
- gotcha MinHashLSH provides *approximate* nearest neighbors, not exact. The `threshold` parameter guides the search but does not guarantee that all pairs above the threshold will be returned, nor that results strictly adhere to the threshold. It retrieves *candidates* that are likely to be similar.
- gotcha Using `RedisMinHashLSH` or its asynchronous counterpart requires installing the `redis` and/or `aioredis` packages separately (e.g., `pip install datasketch[redis]`). Attempting to use these classes without the required backend will result in an `ImportError`.
- gotcha The `num_perm` parameter (number of permutations) used to initialize `MinHash` objects and `MinHashLSH` indexes must be consistent. Mismatched `num_perm` values will lead to incorrect similarity estimations or errors when querying the LSH structure.
- gotcha The 'hnsw' extra (for HNSW index support) is not available in datasketch version 1.9.0. Requesting it via `pip install datasketch[hnsw]` will result in a warning but will not install the intended functionality.
- gotcha Attempting to install `datasketch` with an unsupported extra, such as `datasketch[hnsw]`, will result in a pip warning that the extra is not provided. The installation will proceed, but the intended additional dependencies for that extra will not be installed.
Install
-
pip install datasketch -
pip install datasketch[redis] -
pip install datasketch[hnsw]
Imports
- MinHash
from datasketch import MinHash
- MinHashLSH
from datasketch import MinHashLSH
- MinHashLSHForest
from datasketch import MinHashLSHForest
- WeightedMinHashGenerator
from datasketch import WeightedMinHashGenerator
- bBitMinHash
from datasketch import bBitMinHash
- MinHashLSHDeletionSession
from datasketch.lsh import MinHashLSHDeletionSession
from datasketch import MinHashLSHDeletionSession
Quickstart
from datasketch import MinHash, MinHashLSH
# Create MinHash objects for two sets
s1 = {"minhash", "is", "a", "probabilistic", "data", "structure", "for", "estimating", "similarity", "between", "sets"}
s2 = {"minhash", "is", "a", "data", "structure", "for", "estimating", "similarity", "between", "documents"}
s3 = {"today", "is", "a", "beautiful", "day"}
m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)
for d in s1:
m1.update(d.encode('utf8'))
for d in s2:
m2.update(d.encode('utf8'))
for d in s3:
m3.update(d.encode('utf8'))
# Create an LSH index with a threshold
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m1", m1)
lsh.insert("m2", m2)
lsh.insert("m3", m3)
# Query the LSH for candidates similar to m2
print(f"Candidate keys for m2: {lsh.query(m2)}")