pyahocorasick
pyahocorasick is a fast and memory-efficient Python library for exact or approximate multi-pattern string search. It allows finding multiple key string occurrences simultaneously in an input text using the Aho-Corasick algorithm. Implemented in C for performance, it supports Python 3.10+ and runs on Linux, macOS, and Windows. The library maintains an active development status with regular releases, with the current version being 2.3.0.
Warnings
- breaking Python 3.9 support was dropped in v2.3.0. Python 3.8 support was dropped in v2.2.0, and Python 3.6/3.7 in v2.1.0. Users on older Python versions must use an earlier `pyahocorasick` release.
- breaking The internal trie representation changed in v1.4.0, breaking compatibility with pickle and `save()` formats from previous versions. Automata pickled or saved with older versions cannot be loaded by v1.4.0 or newer.
- gotcha The correct Python module to import is `ahocorasick`, not `pyahocorasick`. Attempting to import `pyahocorasick` will result in an `ImportError`.
- gotcha After adding all words to the `Automaton` (which acts as a Trie initially), you must call the `make_automaton()` method to finalize it into an Aho-Corasick automaton before performing searches with `iter()` or related methods. Failing to do so will result in an `Automaton` in `TRIE` kind state, not `AHOCORASICK` kind.
- gotcha Installing `pyahocorasick` from source requires a C compiler to build the CPython extension. While pre-built wheels are generally available on PyPI, source installation without a compiler will fail.
Install
-
pip install pyahocorasick
Imports
- Automaton
from ahocorasick import Automaton
Quickstart
import ahocorasick
# Create an Automaton object
A = ahocorasick.Automaton()
# Add keywords and associated values (optional)
words_to_find = {
"apple": "fruit",
"apply": "verb",
"banana": "fruit",
"band": "music"
}
for idx, (word, value) in enumerate(words_to_find.items()):
A.add_word(word, (idx, value))
# Finalize the automaton for efficient searching
A.make_automaton()
# Search in a haystack string
haystack = "I like to eat an apple and apply for a new band that plays banana songs."
print("Found matches:")
for end_index, (insertion_order, original_value) in A.iter(haystack):
start_index = end_index - len(original_value) + 1
print(f" Found '{original_value}' at ({start_index}, {end_index})")
# Example of retrieving a value (trie-like behavior)
print(f"\nValue for 'apple': {A.get('apple')}")
# Check existence
print(f"Is 'apple' in the automaton? {'apple' in A}")