Cython Hash Table for Pre-Hashed Keys

raw JSON →
3.0.13 verified Tue May 12 auth: no python install: verified

preshed is a high-performance Cython library for Python that provides efficient hash table data structures. It's designed for use cases where keys are already pre-hashed, offering `PreshMap` for key-value storage, `PreshCounter` for frequency counting, and `BloomFilter` for probabilistic set membership testing. Maintained by Explosion (the creators of spaCy), it sees regular updates primarily for Python version compatibility and performance enhancements, with occasional major releases introducing significant architectural changes.

pip install preshed --only-binary preshed
error ModuleNotFoundError: No module named 'preshed.bloom'
cause The 'preshed' library or one of its specific submodules (like 'bloom') was not correctly installed, or there is a version incompatibility with a dependent library such as spaCy.
fix
Upgrade or reinstall the 'preshed' library, preferably in a clean virtual environment. If it's a dependency, try upgrading the main package (e.g., pip install --upgrade spacy preshed).
error error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
cause preshed is a Cython library requiring a C compiler (like GCC) to build from source, which is needed if a pre-compiled wheel isn't available for your specific Python version and OS, or if Python development headers are missing.
fix
Install the necessary C compiler and Python development headers for your system (e.g., sudo apt-get install build-essential python3-dev on Debian/Ubuntu, or sudo yum install gcc python3-devel on CentOS/RHEL).
error DistutilsPlatformError: Unable to find vcvarsall.bat
cause On Windows, building Cython extensions like preshed requires the Microsoft Visual C++ Build Tools, which are not found in the system's PATH environment variable.
fix
Install the appropriate Microsoft Visual C++ Build Tools (available from the Visual Studio Community edition or as a standalone Build Tools installer) and ensure they are correctly configured for your Python installation.
error Failed building wheel for preshed
cause This general error occurs when pip attempts to build preshed from source (because a pre-compiled wheel is unavailable or incompatible), and the compilation process fails due to missing compiler tools, incompatible Python/Cython versions, or other build environment issues.
fix
Ensure you have the necessary C/C++ compiler installed for your operating system (see compiler-specific fixes above). Also, update pip and setuptools (pip install --upgrade pip setuptools). Alternatively, consider using conda install -c conda-forge preshed which often provides pre-built binaries.
breaking Version 4.0.0 introduced significant internal architectural changes, replacing raw arrays and pointers with `std::vector` and `std::unique_ptr` for `BloomFilter`, `PreshMap`, and `PreshCounter` implementations, and removing `PreshMapArray`. This affects users interacting with the C API or relying on specific internal memory layouts.
fix Review your code for direct C API interactions or assumptions about internal memory management. Adapt to the new C++-backed structures. For Python users, this should mostly be an internal change, but retesting is recommended.
breaking Version 2.0.0 introduced a hard dependency on `cymem>=2.0.0`. Projects using an older version of `cymem` (e.g., `cymem<2.0.0`) would face dependency conflicts.
fix Ensure your project's `cymem` dependency is updated to `cymem>=2.0.0`. If you have other dependencies pinning an older `cymem`, you may need to update those packages or manage your dependency tree carefully.
gotcha The library is explicitly designed for 'pre-hashed' keys (uint64_t values). Feeding non-hashed or poorly hashed data directly into `PreshMap` or `PreshCounter` without proper pre-hashing can lead to suboptimal performance and hash collisions, negating the library's benefits.
fix Always pre-hash your keys into `uint64` integers using a robust hashing algorithm (e.g., `murmurhash`) before using them with `preshed` data structures. The library assumes keys are already randomized.
gotcha While Python APIs for `BloomFilter` and `PreshMap` are thread-safe on Python 3.14+ (including free-threaded builds), the C API and `PreshCounter` class require external synchronization if used in a multithreaded environment to prevent race conditions and data corruption.
fix For multithreaded Python applications, ensure you understand the thread-safety guarantees of each `preshed` class. Use external locking mechanisms (e.g., `threading.Lock`) for `PreshCounter` and direct C API calls in concurrent contexts.
conda install -c conda-forge preshed
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 0.00s 24.4M
3.10 alpine (musl) - - 0.00s 24.4M
3.10 slim (glibc) wheel 1.8s 0.00s 22M
3.10 slim (glibc) - - 0.01s 22M
3.11 alpine (musl) wheel - 0.00s 26.4M
3.11 alpine (musl) - - 0.01s 26.4M
3.11 slim (glibc) wheel 1.8s 0.00s 24M
3.11 slim (glibc) - - 0.00s 24M
3.12 alpine (musl) wheel - 0.00s 18.4M
3.12 alpine (musl) - - 0.00s 18.4M
3.12 slim (glibc) wheel 1.6s 0.00s 16M
3.12 slim (glibc) - - 0.01s 16M
3.13 alpine (musl) wheel - 0.00s 18.1M
3.13 alpine (musl) - - 0.00s 18.0M
3.13 slim (glibc) wheel 1.6s 0.00s 16M
3.13 slim (glibc) - - 0.01s 16M
3.9 alpine (musl) wheel - 0.00s 23.9M
3.9 alpine (musl) - - 0.00s 23.9M
3.9 slim (glibc) wheel 2.0s 0.00s 22M
3.9 slim (glibc) - - 0.01s 22M

Demonstrates the basic usage of PreshMap, including initialization, setting and getting items, membership testing, and deletion. Keys are expected to be 64-bit unsigned integers.

from preshed.maps import PreshMap

# PreshMap expects uint64 keys and values
my_map = PreshMap(initial_size=1024) # Initial size should be a power of 2

# Simulate pre-hashed keys (e.g., using murmurhash)
key1 = 1234567890123456789 # Example uint64
key2 = 9876543210987654321

my_map[key1] = 100
my_map[key2] = 200

print(f"Value for key1: {my_map[key1]}") # Expected: 100
print(f"Value for key2: {my_map[key2]}") # Expected: 200
print(f"Is key1 in map: {key1 in my_map}") # Expected: True

# Test a missing key
missing_key = 1111111111111111111
print(f"Value for missing_key: {my_map[missing_key]}") # Expected: None

# Remove a key
del my_map[key1]
print(f"Is key1 in map after deletion: {key1 in my_map}") # Expected: False