py-tlsh

raw JSON →
4.12.1 verified Fri May 01 auth: no python

TLSH (Trend Micro Locality Sensitive Hashing) is a fuzzy hashing algorithm for similarity comparison of binary data. The py-tlsh package provides a C++ Python extension for computing and comparing TLSH hashes. Current version 4.12.1 (but a major v5.0.0 exists; see warnings). Release cadence is irregular.

pip install py-tlsh
error TypeError: a bytes-like object is required, not 'str'
cause Passed a string to tlsh.hash() instead of bytes.
fix
Encode the string to bytes: tlsh.hash('hello'.encode('utf-8'))
error ImportError: No module named tlsh
cause py-tlsh is not installed or installed incorrectly.
fix
Install via pip install py-tlsh. If that fails, ensure you have a C++ compiler (build-essential on Linux, Xcode command line tools on macOS).
error ValueError: Invalid TLSH hash
cause Provided hash string is not a valid TLSH digest (wrong length or characters).
fix
Ensure the hash is exactly 70 hex characters (v4.x) or starts with 'T1' and is 72 characters (v5.x). Use tlsh.is_valid(hash) to check.
error OSError: [Errno 2] No such file or directory: 'libtlsh.so'
cause Dynamic library not found. Usually occurs when building from source or using a non-standard installation.
fix
Reinstall py-tlsh via pip (it bundles the C extension). If using a custom build, set LD_LIBRARY_PATH or install the library system-wide.
breaking TLSH v5.0.0 changed the default digest prefix to 'T1'. The v4.x series does not include the 'T1' prefix. If you upgrade to v5.0.0 (not yet on PyPI as of this writing) your hashes will be incompatible with v4.x hashes and with other tools that expect the old format.
fix Use v4.12.1 (the latest PyPI release) unless you explicitly need v5 features. If using v5, be aware that all hashes will start with 'T1' and you may need to update storage/comparison logic.
deprecated The function `tlsh.hash()` returns a hex string. In v5.0.0 this may change to include the 'T1' prefix. The old behavior is deprecated and will be removed in a future major release.
fix If you need to ensure forward compatibility, consider stripping or handling the 'T1' prefix. For now, stick with v4.12.1 if you want to avoid breaking changes.
gotcha The `tlsh.diff()` function returns a difference score: 0 means identical, higher values mean more different. This is the opposite of a similarity score (0-100 typical in other libraries). Common mistake: treat the score as a similarity percentage.
fix Use `tlsh.diff(hash1, hash2)` and interpret 0 as identical. For a similarity metric, you can invert the score (e.g., similarity = max(0, 100 - diff)) but note that the maximum difference is not fixed at 100.
gotcha The `tlsh.hash()` function requires a bytes-like object, not a string. Passing a plain string will raise a TypeError.
fix Pass bytes: `tlsh.hash(b'hello')` or `tlsh.hash('hello'.encode('utf-8'))`.
gotcha The py-tlsh extension is compiled from C++. On some platforms (especially Windows and older Linux distros) installation may fail due to missing C++ compiler or headers. The PyPI wheel may not cover all platforms.
fix Install a C++ compiler (e.g., 'build-essential' on Ubuntu, Xcode on macOS, Visual Studio Build Tools on Windows). For Windows, consider using the unofficial Windows binary wheels from Christoph Gohlke or install via conda.

Basic usage: hash creation, comparison, and validation.

import tlsh

# Create a TLSH hash from a byte string
data = b"hello world"
hash = tlsh.hash(data)
print("Hash:", hash)

# Compare two hashes (similarity score)
hash2 = tlsh.hash(b"hello world!")
score = tlsh.diff(hash, hash2)
print("Difference score:", score)

# Check if a hash is valid
print("Is valid:", tlsh.is_valid(hash))