Python Hyperscan
Python bindings for Hyperscan. This library provides high-performance regular expression matching, designed for large-scale pattern matching tasks, including multi-pattern and streaming modes. Version 0.8.2 is the current release, with active development and frequent updates, often seeing several patch releases and minor updates within a few months.
Common errors
-
No module named 'hyperscan._hyperscan'
cause The underlying C extension for hyperscan failed to build or was not correctly installed/linked, or there's a version mismatch between the Python bindings and the underlying Hyperscan/Vectorscan C library if building from source.fixEnsure `pip install hyperscan` completes without errors. If building from source, verify all C/C++ build prerequisites (CMake, C/C++ toolchain, Ragel) are met. For older library versions, ensure the system's Hyperscan library matches the expected version (e.g., `v0.1.5` needed Hyperscan `v4.x`, while `v0.2+` needs `v5.x`). Confirm your Python virtual environment is active. -
Illegal instruction (core dumped) or segmentation fault on `import hyperscan`
cause Hyperscan binaries (especially wheels) can be compiled with CPU-specific optimizations (like AVX instructions). If these instructions are not supported by the CPU where the library is being imported, it can lead to an 'Illegal instruction' crash.fixIf using pre-built wheels, try installing a more generic wheel if available, or ensure your CPU supports the instruction sets used. If building from source, set `CMAKE_ARGS="-DUSE_CPU_NATIVE=OFF"` during `pip install .` to disable CPU-native optimizations, or compile on the target machine. -
hyperscan.error: ScratchInUseError('error code -10')cause Hyperscan's scratch space (`hyperscan.Scratch` object) is not thread-safe and cannot be used concurrently by multiple scanning operations without proper management.fixAllocate a separate `hyperscan.Scratch` object for each concurrent scan operation or thread. Ensure each `db.scan()` or `stream.scan()` call receives its own dedicated `scratch` instance.
Warnings
- breaking Starting in v0.7.9, the build system migration to CMake changed how PCRE is linked, which can cause 'Expression is not valid UTF-8' errors for valid Unicode patterns. This broke existing code that worked in v0.7.8 and earlier, as PCRE was built from source without UTF-8 support enabled.
- gotcha Hyperscan does not support capturing sub-expressions (capture groups). If you need to extract specific parts of a matched string, you will need a two-stage approach: use Hyperscan for high-performance identification, and then a standard regex engine (like Python's `re` module) for detailed extraction on the matched segments.
- gotcha The `scan` methods (block and stream) do not strictly require a `match_event_handler` callback. If no handler is provided, match production is entirely suppressed, meaning you won't get any results. This can be misleading if you expect a return value representing matches.
- gotcha Using `HS_FLAG_SOM_LEFTMOST` to obtain the leftmost start offset of a match (Start Of Match) can significantly impact performance and reduce the range of patterns that Hyperscan can compile, potentially leading to 'Pattern too large' errors.
- gotcha Versions prior to 0.8.2 had a bug that could cause incorrect match offset truncation when scanning data buffers larger than 4GB, leading to potentially inaccurate match positions.
Install
-
pip install hyperscan
Imports
- hyperscan
import hyperscan
- Database
from hyperscan import Database, HS_FLAG_CASELESS, HS_FLAG_SOM_LEFTMOST
Quickstart
import hyperscan
def on_match(id: int, from_: int, to: int, flags: int, context: object | None) -> int:
print(f"Match for pattern ID {id} at [{from_}:{to}] with flags {flags}")
return 0 # Continue scanning
# Define patterns with IDs and flags
patterns_config = [
(b'foobar', 101, 0), # Simple literal match
(b'baz', 102, hyperscan.HS_FLAG_CASELESS), # Case-insensitive
(b'qux', 103, hyperscan.HS_FLAG_SOM_LEFTMOST | hyperscan.HS_FLAG_SINGLEMATCH) # Report start of match, single match
]
expressions, ids, flags = zip(*patterns_config)
db = hyperscan.Database()
db.compile(
expressions=expressions,
ids=ids,
elements=len(patterns_config),
flags=flags
)
# Create a scratch space for scanning
scratch = db.alloc_scratch()
# Scan a data buffer in block mode
data = b'This is a FoObAr string with baz and QuX inside.'
print(f"Scanning data: '{data.decode()}'")
matches_found = db.scan(data=data, scratch=scratch, match_event_handler=on_match)
if not matches_found:
print("No matches found.")
# Example of streaming mode
print("\n--- Streaming Mode ---")
db_streaming = hyperscan.Database(mode=hyperscan.HS_MODE_STREAM)
db_streaming.compile(
expressions=[b'stream_test'],
ids=[201],
elements=1,
flags=[0]
)
scratch_streaming = db_streaming.alloc_scratch()
with db_streaming.stream(scratch=scratch_streaming, match_event_handler=on_match) as stream:
stream.scan(data=b'first part of stream_test data')
stream.scan(data=b'cond part of stream_test data')
# Matches might only be reported on close or when enough data accumulated.
print("Streaming scan initiated. Matches may be reported during stream.scan or stream.close.")