Flashtext Keyword Processor
Flashtext is a Python library designed for efficient keyword extraction and replacement in sentences. It employs a custom algorithm based on Aho-Corasick and Trie data structures, providing significant performance gains over regular expressions, especially for large dictionaries of keywords. The current stable version is 2.7, released in 2018, and it is largely in a maintenance state, though still widely used.
Warnings
- gotcha Flashtext's default word boundary definition (`[A-Za-z0-9_]`) might not be suitable for all languages (e.g., Chinese, Japanese) or custom requirements. It may fail to identify keywords correctly if they are not separated by these specific non-word characters. Users can customize `non_word_boundaries`.
- gotcha Flashtext generally outperforms regex for keyword extraction/replacement when the number of keywords is large (typically >500). For a small number of keywords or when complex patterns (like partial matches or special character handling) are required, regular expressions might be equally or more efficient, or simply the only solution.
- gotcha If `add_keyword()` is used with a tuple as the `clean_name` (e.g., `add_keyword('Taj Mahal', ('Monument', 'Taj Mahal'))`), the `replace_keywords()` method will not function as expected because it anticipates a string replacement, not a tuple.
- deprecated A separate, community-driven package `flashtext2` (and `flashtextr`) exists, which is a rewrite in Rust, offering significant performance improvements (3-10x faster) and better Unicode handling. While not an official successor from the original author, it addresses some limitations of `flashtext`.
Install
-
pip install flashtext
Imports
- KeywordProcessor
from flashtext import KeywordProcessor
Quickstart
from flashtext import KeywordProcessor
# Initialize the keyword processor (case_sensitive=False by default)
keyword_processor = KeywordProcessor()
# Add keywords. Can map multiple 'unclean' names to one 'clean' name.
keyword_processor.add_keyword('Big Apple', 'New York')
keyword_processor.add_keyword('Bay Area')
keyword_processor.add_keyword('New Delhi', 'NCR region')
# Extract keywords
text_to_extract = 'I love Big Apple and Bay Area. New Delhi is also great.'
keywords_found = keyword_processor.extract_keywords(text_to_extract)
print(f"Extracted keywords: {keywords_found}") # Expected: ['New York', 'Bay Area', 'NCR region']
# Replace keywords
text_to_replace = 'I love Big Apple and new delhi.'
new_sentence = keyword_processor.replace_keywords(text_to_replace)
print(f"Replaced sentence: {new_sentence}") # Expected: 'I love New York and NCR region.'
# Extract with span information
keywords_with_span = keyword_processor.extract_keywords('I love Big Apple.', span_info=True)
print(f"Keywords with span: {keywords_with_span}") # Expected: [('New York', 7, 16)]