datasketch

1.9.0 · active · verified Wed Apr 08

datasketch is a Python library that provides probabilistic data structures for efficient similarity search and approximate nearest neighbor (ANN) computations on very large datasets. It currently stands at version 1.9.0 and maintains an active release cadence, with updates addressing features, fixes, and dependency management.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core functionality of datasketch: creating MinHash objects from sets and using MinHashLSH to find approximate nearest neighbors. The example initializes three MinHash objects from different text sets, inserts them into an LSH index, and then queries the index to find items similar to `m2`.

from datasketch import MinHash, MinHashLSH

# Create MinHash objects for two sets
s1 = {"minhash", "is", "a", "probabilistic", "data", "structure", "for", "estimating", "similarity", "between", "sets"}
s2 = {"minhash", "is", "a", "data", "structure", "for", "estimating", "similarity", "between", "documents"}
s3 = {"today", "is", "a", "beautiful", "day"}

m1 = MinHash(num_perm=128)
m2 = MinHash(num_perm=128)
m3 = MinHash(num_perm=128)

for d in s1:
    m1.update(d.encode('utf8'))
for d in s2:
    m2.update(d.encode('utf8'))
for d in s3:
    m3.update(d.encode('utf8'))

# Create an LSH index with a threshold
lsh = MinHashLSH(threshold=0.5, num_perm=128)
lsh.insert("m1", m1)
lsh.insert("m2", m2)
lsh.insert("m3", m3)

# Query the LSH for candidates similar to m2
print(f"Candidate keys for m2: {lsh.query(m2)}")

view raw JSON →