TextSearch
TextSearch is a Python library designed for efficient and convenient searching and replacing of multiple strings within text. It leverages C-speed through an Aho-Corasick implementation, making it significantly faster than equivalent regex operations for specific tasks. The library focuses on providing convenience for Natural Language Processing (NLP) and text search tasks, often defaulting to full word matches rather than sub-matches. The current version is 0.0.24, with releases appearing on an as-needed basis.
Warnings
- gotcha By default, TextSearch focuses on full word matches (tokens). Users accustomed to standard regex might expect sub-string matches. This behavior can be configured but is a key distinction from typical regex patterns.
- gotcha The library relies on a C-module (`pyahocorasick`) for its performance benefits. This means installation might require a C compiler and development headers on some systems, potentially leading to build errors if not available.
- gotcha While TextSearch is significantly faster than regex for *multiple* string searches, for searching a *single* simple string, Python's built-in `str.find()` or `in` operator might be sufficient and have less overhead.
Install
-
pip install textsearch
Imports
- TextSearch
from textsearch import TextSearch
Quickstart
from textsearch import TextSearch
ts = TextSearch(case="ignore", returns="match")
words_to_find = ["hi", "bye", "hello"]
ts.add(words_to_find)
text = "Hello, hi Pascal, bye, how are you?"
found_matches = ts.findall(text)
print(f"Original text: {text}")
print(f"Words added for search: {words_to_find}")
print(f"Found matches: {found_matches}")
# Example of replacement
ts_replace = TextSearch(case="ignore", returns="replace")
ts_replace.add("hi", "GREETING")
ts_replace.add("bye", "FAREWELL")
replaced_text = ts_replace.replace(text)
print(f"Text after replacement: {replaced_text}")