Aho-Corasick Rust Bindings for Python
ahocorasick-rs is a Python library that provides efficient multi-pattern string searching capabilities. It's implemented as a high-performance wrapper around the Rust `aho-corasick` library, offering a significantly faster alternative to pure Python or C-backed `pyahocorasick` for searching many substrings simultaneously. The library is actively maintained, with its latest version being 1.0.3, and typically releases updates as needed for performance improvements or new Python version support.
Common errors
-
AttributeError: module 'ahocorasick_rs' has no attribute 'Automaton'
cause Users migrating from the `pyahocorasick` library often incorrectly assume the class name 'Automaton' is used in `ahocorasick_rs`.fixThe main class for string matching in `ahocorasick_rs` is `AhoCorasick` (or `BytesAhoCorasick` for byte strings). Replace `ahocorasick_rs.Automaton` with `ahocorasick_rs.AhoCorasick`. -
TypeError: patterns must be an iterable of strings or bytes
cause The `AhoCorasick` constructor expects an iterable (like a list or tuple) of patterns, and all patterns must be of the same type (all strings or all bytes).fixEnsure the input `patterns` argument is an iterable (e.g., `['pattern1', 'pattern2']`) and that all elements within it are consistent (e.g., all `str` or all `bytes`). -
ModuleNotFoundError: No module named 'ahocorasick_rs'
cause The `ahocorasick-rs` package is not installed in the current Python environment or the environment is not correctly activated.fixInstall the library using `pip install ahocorasick-rs`. If using a virtual environment, ensure it's activated.
Warnings
- breaking Prior to version 1.0.0, the API of ahocorasick-rs was not guaranteed to be stable and may have included breaking changes in minor or patch releases. Users on older versions should consult specific release notes for migration paths. Version 1.0.0 introduced API stability.
- gotcha While highly optimized, for very small haystacks or a minimal number of patterns (e.g., 1-3 patterns), the overhead of constructing the Aho-Corasick automaton might make simple `str.replace()` or regular expression searches slightly faster due to constant factors. The benefits of Aho-Corasick scale significantly with more patterns and larger haystacks.
- gotcha The underlying Aho-Corasick algorithm has different 'MatchKind' semantics (e.g., standard, leftmost-first, leftmost-longest) that dictate how overlapping matches are reported. Using the wrong MatchKind can lead to unexpected results. The default `MATCHKIND_STANDARD` reports all possible matches, including overlaps.
- gotcha Building a Deterministic Finite Automaton (DFA) for maximum search speed can be memory-intensive and slow, especially with a very large number of patterns. The library uses a heuristic by default, but you can explicitly configure the underlying `Implementation` (e.g., `DFA`, `NFA`) which offers trade-offs between build time, memory usage, and search speed.
Install
-
pip install ahocorasick-rs
Imports
- AhoCorasick
from ahocorasick_rs import AhoCorasick
- BytesAhoCorasick
from ahocorasick_rs import BytesAhoCorasick
Quickstart
import ahocorasick_rs
patterns = ["hello", "world", "fish"]
haystack = "this is my first hello world. hello!"
# Create an AhoCorasick automaton
ac = ahocorasick_rs.AhoCorasick(patterns)
# Find matches and their indexes (pattern_index, start_index, end_index)
matches_by_index = ac.find_matches_as_indexes(haystack)
print(f"Matches by index: {matches_by_index}")
# Expected: [(0, 17, 22), (1, 23, 28), (0, 30, 35)]
# Find matches and return the actual strings
matches_as_strings = ac.find_matches_as_strings(haystack)
print(f"Matches as strings: {matches_as_strings}")
# Expected: ['hello', 'world', 'hello']
# For byte strings
byte_patterns = [b"foo", b"bar"]
byte_haystack = b"this is foo and bar"
byte_ac = ahocorasick_rs.BytesAhoCorasick(byte_patterns)
byte_matches = byte_ac.find_matches_as_indexes(byte_haystack)
print(f"Byte matches: {byte_matches}")