{"id":1301,"library":"datasketch","title":"datasketch","description":"datasketch is a Python library that provides probabilistic data structures for efficient similarity search and approximate nearest neighbor (ANN) computations on very large datasets. It currently stands at version 1.9.0 and maintains an active release cadence, with updates addressing features, fixes, and dependency management.","status":"active","version":"1.9.0","language":"en","source_language":"en","source_url":"https://github.com/ekzhu/datasketch","tags":["minhashing","lsh","probabilistic-data-structures","similarity","data-deduplication","approximate-nearest-neighbor","ann"],"install":[{"cmd":"pip install datasketch","lang":"bash","label":"Standard Installation"},{"cmd":"pip install datasketch[redis]","lang":"bash","label":"With Redis support (for RedisMinHashLSH)"},{"cmd":"pip install datasketch[hnsw]","lang":"bash","label":"With HNSW support (for HNSWIndex)"}],"dependencies":[{"reason":"Core dependency for numerical operations within probabilistic data structures.","package":"numpy","optional":false},{"reason":"Required for RedisMinHashLSH, which stores MinHash sketches in a Redis server for distributed LSH. Not required for in-memory LSH.","package":"redis","optional":true},{"reason":"Required for asynchronous RedisMinHashLSH functionality, introduced in v1.9.0, providing async Redis integration. Not required for synchronous or in-memory LSH.","package":"aioredis","optional":true},{"reason":"Required for HNSWIndex for Approximate Nearest Neighbor search.","package":"hnswlib","optional":true}],"imports":[{"symbol":"MinHash","correct":"from datasketch import MinHash"},{"symbol":"MinHashLSH","correct":"from datasketch import MinHashLSH"},{"symbol":"MinHashLSHForest","correct":"from datasketch import MinHashLSHForest"},{"symbol":"WeightedMinHashGenerator","correct":"from datasketch import WeightedMinHashGenerator"},{"symbol":"bBitMinHash","correct":"from datasketch import bBitMinHash"},{"note":"Introduced in v1.8.0, often imported directly from the top-level package.","wrong":"from datasketch.lsh import MinHashLSHDeletionSession","symbol":"MinHashLSHDeletionSession","correct":"from datasketch import MinHashLSHDeletionSession"}],"quickstart":{"code":"from datasketch import MinHash, MinHashLSH\n\n# Create MinHash objects for two sets\ns1 = {\"minhash\", \"is\", \"a\", \"probabilistic\", \"data\", \"structure\", \"for\", \"estimating\", \"similarity\", \"between\", \"sets\"}\ns2 = {\"minhash\", \"is\", \"a\", \"data\", \"structure\", \"for\", \"estimating\", \"similarity\", \"between\", \"documents\"}\ns3 = {\"today\", \"is\", \"a\", \"beautiful\", \"day\"}\n\nm1 = MinHash(num_perm=128)\nm2 = MinHash(num_perm=128)\nm3 = MinHash(num_perm=128)\n\nfor d in s1:\n    m1.update(d.encode('utf8'))\nfor d in s2:\n    m2.update(d.encode('utf8'))\nfor d in s3:\n    m3.update(d.encode('utf8'))\n\n# Create an LSH index with a threshold\nlsh = MinHashLSH(threshold=0.5, num_perm=128)\nlsh.insert(\"m1\", m1)\nlsh.insert(\"m2\", m2)\nlsh.insert(\"m3\", m3)\n\n# Query the LSH for candidates similar to m2\nprint(f\"Candidate keys for m2: {lsh.query(m2)}\")","lang":"python","description":"This quickstart demonstrates the core functionality of datasketch: creating MinHash objects from sets and using MinHashLSH to find approximate nearest neighbors. The example initializes three MinHash objects from different text sets, inserts them into an LSH index, and then queries the index to find items similar to `m2`."},"warnings":[{"fix":"Understand the probabilistic nature of LSH. For exact similarity, a brute-force comparison is needed. The `MinHash.jaccard()` method can be used to compute exact Jaccard similarity between two MinHash objects.","message":"MinHashLSH provides *approximate* nearest neighbors, not exact. The `threshold` parameter guides the search but does not guarantee that all pairs above the threshold will be returned, nor that results strictly adhere to the threshold. It retrieves *candidates* that are likely to be similar.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install `datasketch` with the `[redis]` extra (`pip install datasketch[redis]`) or explicitly install `redis` and/or `aioredis` packages.","message":"Using `RedisMinHashLSH` or its asynchronous counterpart requires installing the `redis` and/or `aioredis` packages separately (e.g., `pip install datasketch[redis]`). Attempting to use these classes without the required backend will result in an `ImportError`.","severity":"gotcha","affected_versions":"All versions with Redis support (v1.5.0+ for `redis`, v1.9.0+ for `aioredis`)"},{"fix":"Ensure `num_perm` is the same across all `MinHash` objects intended for comparison and for the `MinHashLSH` or `MinHashLSHForest` index they are inserted into.","message":"The `num_perm` parameter (number of permutations) used to initialize `MinHash` objects and `MinHashLSH` indexes must be consistent. Mismatched `num_perm` values will lead to incorrect similarity estimations or errors when querying the LSH structure.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-08T00:00:00.000Z","next_check":"2026-07-07T00:00:00.000Z"}