Simhash Python Library

2.1.2 · active · verified Thu Apr 16

The `simhash` library provides a Python implementation of the Simhash Algorithm, a technique for quickly finding near-duplicate documents or comparing the similarity of two texts or data objects. It's highly useful for tasks like large-scale content deduplication, spam detection, and content recommendation, offering a fast way to identify perceptually similar items. The current version is 2.1.2, and it follows an irregular release cadence based on contributions and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create `Simhash` objects from strings and calculate the Hamming distance between them. A smaller distance indicates greater similarity. The choice of similarity threshold depends on your specific application.

from simhash import Simhash

# Create Simhash objects from text
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A quick brown fox jumps over the sleeping dog."
text3 = "Python is a programming language."

hash1 = Simhash(text1)
hash2 = Simhash(text2)
hash3 = Simhash(text3)

# Calculate Hamming distance between hashes
# A lower distance means higher similarity
print(f"Distance between '{text1[:20]}...' and '{text2[:20]}...': {hash1.distance(hash2)}")
print(f"Distance between '{text1[:20]}...' and '{text3[:20]}...': {hash1.distance(hash3)}")

# You can use a similarity threshold to determine if items are 'duplicates'
similarity_threshold = 3
if hash1.distance(hash2) < similarity_threshold:
    print(f"'{text1[:20]}...' and '{text2[:20]}...' are considered very similar.")
else:
    print(f"'{text1[:20]}...' and '{text2[:20]}...' are not considered very similar.")

view raw JSON →