pyahocorasick

2.3.0 · active · verified Thu Apr 09

pyahocorasick is a fast and memory-efficient Python library for exact or approximate multi-pattern string search. It allows finding multiple key string occurrences simultaneously in an input text using the Aho-Corasick algorithm. Implemented in C for performance, it supports Python 3.10+ and runs on Linux, macOS, and Windows. The library maintains an active development status with regular releases, with the current version being 2.3.0.

Warnings

Install

Imports

Quickstart

Initialise an `Automaton`, add words (and optional values), then call `make_automaton()` to compile the trie into a searchable Aho-Corasick automaton. The `iter()` method yields `(end_index, value)` for all matches found in the input string.

import ahocorasick

# Create an Automaton object
A = ahocorasick.Automaton()

# Add keywords and associated values (optional)
words_to_find = {
    "apple": "fruit",
    "apply": "verb",
    "banana": "fruit",
    "band": "music"
}

for idx, (word, value) in enumerate(words_to_find.items()):
    A.add_word(word, (idx, value))

# Finalize the automaton for efficient searching
A.make_automaton()

# Search in a haystack string
haystack = "I like to eat an apple and apply for a new band that plays banana songs."

print("Found matches:")
for end_index, (insertion_order, original_value) in A.iter(haystack):
    start_index = end_index - len(original_value) + 1
    print(f"  Found '{original_value}' at ({start_index}, {end_index})")

# Example of retrieving a value (trie-like behavior)
print(f"\nValue for 'apple': {A.get('apple')}")

# Check existence
print(f"Is 'apple' in the automaton? {'apple' in A}")

view raw JSON →