Cython Hash Table for Pre-Hashed Keys
preshed is a high-performance Cython library for Python that provides efficient hash table data structures. It's designed for use cases where keys are already pre-hashed, offering `PreshMap` for key-value storage, `PreshCounter` for frequency counting, and `BloomFilter` for probabilistic set membership testing. Maintained by Explosion (the creators of spaCy), it sees regular updates primarily for Python version compatibility and performance enhancements, with occasional major releases introducing significant architectural changes.
Warnings
- breaking Version 4.0.0 introduced significant internal architectural changes, replacing raw arrays and pointers with `std::vector` and `std::unique_ptr` for `BloomFilter`, `PreshMap`, and `PreshCounter` implementations, and removing `PreshMapArray`. This affects users interacting with the C API or relying on specific internal memory layouts.
- breaking Version 2.0.0 introduced a hard dependency on `cymem>=2.0.0`. Projects using an older version of `cymem` (e.g., `cymem<2.0.0`) would face dependency conflicts.
- gotcha The library is explicitly designed for 'pre-hashed' keys (uint64_t values). Feeding non-hashed or poorly hashed data directly into `PreshMap` or `PreshCounter` without proper pre-hashing can lead to suboptimal performance and hash collisions, negating the library's benefits.
- gotcha While Python APIs for `BloomFilter` and `PreshMap` are thread-safe on Python 3.14+ (including free-threaded builds), the C API and `PreshCounter` class require external synchronization if used in a multithreaded environment to prevent race conditions and data corruption.
Install
-
pip install preshed --only-binary preshed -
conda install -c conda-forge preshed
Imports
- PreshMap
from preshed.maps import PreshMap
- BloomFilter
from preshed.bloom import BloomFilter
- PreshCounter
from preshed.counter import PreshCounter
Quickstart
from preshed.maps import PreshMap
# PreshMap expects uint64 keys and values
my_map = PreshMap(initial_size=1024) # Initial size should be a power of 2
# Simulate pre-hashed keys (e.g., using murmurhash)
key1 = 1234567890123456789 # Example uint64
key2 = 9876543210987654321
my_map[key1] = 100
my_map[key2] = 200
print(f"Value for key1: {my_map[key1]}") # Expected: 100
print(f"Value for key2: {my_map[key2]}") # Expected: 200
print(f"Is key1 in map: {key1 in my_map}") # Expected: True
# Test a missing key
missing_key = 1111111111111111111
print(f"Value for missing_key: {my_map[missing_key]}") # Expected: None
# Remove a key
del my_map[key1]
print(f"Is key1 in map after deletion: {key1 in my_map}") # Expected: False