{"id":9306,"library":"simhash","title":"Simhash Python Library","description":"The `simhash` library provides a Python implementation of the Simhash Algorithm, a technique for quickly finding near-duplicate documents or comparing the similarity of two texts or data objects. It's highly useful for tasks like large-scale content deduplication, spam detection, and content recommendation, offering a fast way to identify perceptually similar items. The current version is 2.1.2, and it follows an irregular release cadence based on contributions and bug fixes.","status":"active","version":"2.1.2","language":"en","source_language":"en","source_url":"https://github.com/1e0ng/simhash","tags":["hashing","similarity","deduplication","fuzzy matching","information retrieval","text processing"],"install":[{"cmd":"pip install simhash","lang":"bash","label":"Install stable version"}],"dependencies":[],"imports":[{"symbol":"Simhash","correct":"from simhash import Simhash"}],"quickstart":{"code":"from simhash import Simhash\n\n# Create Simhash objects from text\ntext1 = \"The quick brown fox jumps over the lazy dog.\"\ntext2 = \"A quick brown fox jumps over the sleeping dog.\"\ntext3 = \"Python is a programming language.\"\n\nhash1 = Simhash(text1)\nhash2 = Simhash(text2)\nhash3 = Simhash(text3)\n\n# Calculate Hamming distance between hashes\n# A lower distance means higher similarity\nprint(f\"Distance between '{text1[:20]}...' and '{text2[:20]}...': {hash1.distance(hash2)}\")\nprint(f\"Distance between '{text1[:20]}...' and '{text3[:20]}...': {hash1.distance(hash3)}\")\n\n# You can use a similarity threshold to determine if items are 'duplicates'\nsimilarity_threshold = 3\nif hash1.distance(hash2) < similarity_threshold:\n    print(f\"'{text1[:20]}...' and '{text2[:20]}...' are considered very similar.\")\nelse:\n    print(f\"'{text1[:20]}...' and '{text2[:20]}...' are not considered very similar.\")","lang":"python","description":"This quickstart demonstrates how to create `Simhash` objects from strings and calculate the Hamming distance between them. A smaller distance indicates greater similarity. The choice of similarity threshold depends on your specific application."},"warnings":[{"fix":"Implement a custom callable function for feature extraction that takes a string and returns an iterable of features (strings or numbers), then pass it as `f=your_function` to the `Simhash` constructor.","message":"The default feature extraction (`f` parameter) is a simple word tokenizer, which may not be optimal for all types of text (e.g., code, specific languages, structured data). For better accuracy, especially with diverse content, consider providing a custom feature extraction function.","severity":"gotcha","affected_versions":"All"},{"fix":"To compare the actual hash values, use `simhash_obj_a.value == simhash_obj_b.value`. To measure similarity, use `simhash_obj_a.distance(simhash_obj_b)` which returns the Hamming distance.","message":"Simhash objects cannot be directly compared for hash value equality using `==`. Doing so will compare the object identities, not their underlying hash values, always returning `False` for distinct objects.","severity":"gotcha","affected_versions":"All"},{"fix":"Interpret `distance(other)` results carefully: 0 means identical hashes, while higher numbers mean less similar. Define an appropriate distance threshold for your application's 'similar' definition (e.g., distance < 3).","message":"The `distance()` method returns the Hamming distance, which is an integer representing the number of differing bits between two hashes. A *lower* distance indicates *higher* similarity, not the other way around. This can be counter-intuitive compared to similarity scores that range from 0 to 1.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `pip install simhash` in your terminal to install the library.","cause":"The `simhash` library has not been installed in your Python environment.","error":"ModuleNotFoundError: No module named 'simhash'"},{"fix":"Ensure the `f` parameter is assigned a function that takes a string and returns an iterable of features, e.g., `Simhash(text, f=my_feature_extractor_function)`.","cause":"You likely passed a string directly to the `f` parameter of the `Simhash` constructor, but it expects a callable function for feature extraction.","error":"TypeError: 'str' object is not callable"},{"fix":"Check the input string for content. If using a custom `f` function, ensure it correctly processes the input and returns at least one feature for non-empty or meaningful inputs. For example, your tokenizer might be too aggressive, removing all tokens.","cause":"The input text or the custom feature extraction function (`f`) resulted in no features being generated, leading to an inability to compute the hash.","error":"ValueError: input is empty or contains no features"}]}