ClickHouse CityHash Bindings
clickhouse-cityhash provides Python bindings for a specific, older version of Google's CityHash algorithm (v1.0.2). This library is primarily used to ensure compatibility with ClickHouse servers, which internally use this particular CityHash version for various hashing operations, including data in its protocol. It is a fork of the broader `python-cityhash` library, specifically tailored for the ClickHouse ecosystem. The current version is 1.0.2.5, and it receives updates for compatibility and bug fixes.
Common errors
-
TypeError: argument 1 must be bytes, not str
cause Attempting to hash a Python `str` object directly with `CityHash64` or `CityHash128`.fixEncode the string to bytes first: `CityHash64(my_string.encode('utf-8'))`. -
Hash mismatch between Python application and ClickHouse server for the same input.
cause Using a different CityHash implementation or version (e.g., the more general `python-cityhash` library) which produces different hashes than ClickHouse's internal v1.0.2 CityHash.fixEnsure you are using `clickhouse-cityhash` for hashing data intended for ClickHouse, as it specifically implements the compatible CityHash v1.0.2 algorithm. -
ImportError: cannot import name 'CityHash64' from 'clickhouse_cityhash' (...)
cause Trying to import `CityHash64` or `CityHash128` from a top-level `clickhouse_cityhash` module directly.fixThe functions are exposed under the `cityhash` module. Use `from cityhash import CityHash64`.
Warnings
- breaking This library implements CityHash v1.0.2. Modern versions of CityHash (and the general `python-cityhash` library) produce different hash values. This library is specifically for compatibility with ClickHouse's internal hashing, not for general-purpose latest CityHash usage.
- gotcha CityHash is a *non-cryptographic* hash function. It is optimized for speed and good distribution, but it is not designed to be collision-resistant against malicious input. Do NOT use it for security-sensitive applications like password storage, digital signatures, or integrity checks where adversarial input is possible.
- gotcha CityHash functions operate strictly on byte strings (`bytes`), not Python unicode strings (`str`). Passing a `str` directly will result in a `TypeError` or incorrect results if Python implicitly attempts a conversion.
- gotcha This implementation of CityHash does not support incremental hashing. It is not suitable for hashing long data streams or data that arrives in chunks, as the entire input must be provided at once.
- gotcha When hashing integers, convert them to a fixed-size byte representation for consistent and reproducible results across different environments or Python versions. Variable-length byte representations can lead to inconsistent hashes.
Install
-
pip install clickhouse-cityhash
Imports
- CityHash64
import clickhouse_cityhash; clickhouse_cityhash.CityHash64(...)
from cityhash import CityHash64
- CityHash128
import clickhouse_cityhash; clickhouse_cityhash.CityHash128(...)
from cityhash import CityHash128
Quickstart
from cityhash import CityHash64, CityHash128
data_string = 'hello world'
data_bytes = data_string.encode('utf-8')
hash64 = CityHash64(data_bytes)
hash128 = CityHash128(data_bytes)
print(f"CityHash64 for '{data_string}': {hash64}")
print(f"CityHash128 for '{data_string}': {hash128}")
# Hashing an integer (must be converted to bytes for consistent results)
integer_data = 123456789
integer_bytes = integer_data.to_bytes(8, 'big') # 8 bytes for CityHash64
hash64_int = CityHash64(integer_bytes)
print(f"CityHash64 for integer {integer_data}: {hash64_int}")