Simhash Python Library
The `simhash` library provides a Python implementation of the Simhash Algorithm, a technique for quickly finding near-duplicate documents or comparing the similarity of two texts or data objects. It's highly useful for tasks like large-scale content deduplication, spam detection, and content recommendation, offering a fast way to identify perceptually similar items. The current version is 2.1.2, and it follows an irregular release cadence based on contributions and bug fixes.
Common errors
-
ModuleNotFoundError: No module named 'simhash'
cause The `simhash` library has not been installed in your Python environment.fixRun `pip install simhash` in your terminal to install the library. -
TypeError: 'str' object is not callable
cause You likely passed a string directly to the `f` parameter of the `Simhash` constructor, but it expects a callable function for feature extraction.fixEnsure the `f` parameter is assigned a function that takes a string and returns an iterable of features, e.g., `Simhash(text, f=my_feature_extractor_function)`. -
ValueError: input is empty or contains no features
cause The input text or the custom feature extraction function (`f`) resulted in no features being generated, leading to an inability to compute the hash.fixCheck the input string for content. If using a custom `f` function, ensure it correctly processes the input and returns at least one feature for non-empty or meaningful inputs. For example, your tokenizer might be too aggressive, removing all tokens.
Warnings
- gotcha The default feature extraction (`f` parameter) is a simple word tokenizer, which may not be optimal for all types of text (e.g., code, specific languages, structured data). For better accuracy, especially with diverse content, consider providing a custom feature extraction function.
- gotcha Simhash objects cannot be directly compared for hash value equality using `==`. Doing so will compare the object identities, not their underlying hash values, always returning `False` for distinct objects.
- gotcha The `distance()` method returns the Hamming distance, which is an integer representing the number of differing bits between two hashes. A *lower* distance indicates *higher* similarity, not the other way around. This can be counter-intuitive compared to similarity scores that range from 0 to 1.
Install
-
pip install simhash
Imports
- Simhash
from simhash import Simhash
Quickstart
from simhash import Simhash
# Create Simhash objects from text
text1 = "The quick brown fox jumps over the lazy dog."
text2 = "A quick brown fox jumps over the sleeping dog."
text3 = "Python is a programming language."
hash1 = Simhash(text1)
hash2 = Simhash(text2)
hash3 = Simhash(text3)
# Calculate Hamming distance between hashes
# A lower distance means higher similarity
print(f"Distance between '{text1[:20]}...' and '{text2[:20]}...': {hash1.distance(hash2)}")
print(f"Distance between '{text1[:20]}...' and '{text3[:20]}...': {hash1.distance(hash3)}")
# You can use a similarity threshold to determine if items are 'duplicates'
similarity_threshold = 3
if hash1.distance(hash2) < similarity_threshold:
print(f"'{text1[:20]}...' and '{text2[:20]}...' are considered very similar.")
else:
print(f"'{text1[:20]}...' and '{text2[:20]}...' are not considered very similar.")