Confusables
Confusables is a Python package designed for analyzing and matching words that appear similar but use different Unicode characters. It leverages the official Unicode confusable characters list to detect homoglyphs, which can be useful for applications like identifying malicious fake website names, normalizing text data, or bypassing profanity filters. The library is currently at version 1.2.0 and receives updates as needed, particularly for Unicode character set changes.
Warnings
- breaking The `match_subword` option was removed from the `confusable_regex()` function in version 1.0.0. It now behaves as if `match_subword` is always true.
- breaking Version 1.0.0 updated to Unicode Confusables version 12.1.0, and now matches all Unicode characters with themselves. This may change the set of characters considered confusable compared to older versions.
- gotcha The definition of 'confusable' is intentionally loose and may become more or less strict in future versions, as it deals with human interpretation. This could subtly alter matching behavior between releases.
Install
-
pip install confusables
Imports
- is_confusable
from confusables import is_confusable
- confusable_characters
from confusables import confusable_characters
- confusable_regex
from confusables import confusable_regex
- normalize
from confusables import normalize
Quickstart
from confusables import is_confusable, confusable_regex, normalize
# Check if two strings are confusable
print(f"'rover' vs 'ƦỏV3ℛ': {is_confusable('rover', 'ƦỏV3ℛ')}")
# Generate a regex for confusable characters
regex_pattern = confusable_regex('admin', include_character_padding=True)
print(f"Regex for 'admin': {regex_pattern}")
# Normalize a string to its confusable ASCII counterparts
normalized_forms = normalize('micrоsoft', prioritize_alpha=True)
print(f"Normalized forms of 'micrоsoft': {normalized_forms}")