Confusable Homoglyphs
Confusable Homoglyphs (version 3.3.1) is a Python library designed to detect and prevent homograph attacks by identifying visually similar Unicode characters (homoglyphs). It provides functionality to check for mixed-script strings and characters that might be confused with others from a preferred set of Unicode blocks. The library is actively maintained with regular updates to its underlying Unicode data and Python compatibility, aiming to safeguard against visual deception in text.
Warnings
- breaking Version 3.3.0 dropped support for Python 2 and Python versions older than 3.7. Ensure your environment uses Python 3.7 or newer for compatibility.
- breaking The `confusables.is_dangerous()` function's return value changed in version 3.3.0. It now strictly returns a boolean (True/False) as documented, whereas previously it might have returned `False` or a list of confusable characters.
- gotcha The underlying Unicode data files (categories.json and confusables.json) are now distributed with the package. While they are automatically recreated if deleted, users can optionally update them via a CLI tool, whose dependencies are installed with `pip install confusable_homoglyphs[cli]`. This ensures the data is always up-to-date with the latest Unicode Consortium releases.
- gotcha Instantiating `Categories` or `Confusables` objects can be a computationally intensive operation due to the loading and parsing of large Unicode data files. It is recommended to create these objects once and reuse them across multiple calls to improve performance.
- gotcha There are other Python libraries, such as `homoglyphs` (sometimes referred to as `homoglyphs_fork`), that offer similar but distinct functionality. Ensure you are importing from `confusable_homoglyphs` to use this specific library's API and features.
Install
-
pip install confusable-homoglyphs
Imports
- confusables
from confusable_homoglyphs import confusables
Quickstart
from confusable_homoglyphs import confusables
# Check if a string contains mixed scripts and dangerous confusable characters
text_dangerous = "ΑlaskaJazz" # First char is Greek Alpha
text_safe = "AlaskaJazz" # All Latin
is_dangerous_result = confusables.is_dangerous(text_dangerous, preferred_aliases=['latin'])
print(f"Is '{text_dangerous}' dangerous? {is_dangerous_result}")
is_dangerous_result_safe = confusables.is_dangerous(text_safe, preferred_aliases=['latin'])
print(f"Is '{text_safe}' dangerous? {is_dangerous_result_safe}")
# Check for specific confusable characters within a string
text_confusable = "microsоft"
is_confusable_result = confusables.is_confusable(text_confusable, greedy=True, preferred_aliases=['latin'])
print(f"Is '{text_confusable}' confusable? {is_confusable_result}")
text_not_confusable = "microsoft"
is_confusable_result_safe = confusables.is_confusable(text_not_confusable, greedy=True, preferred_aliases=['latin'])
print(f"Is '{text_not_confusable}' confusable? {is_confusable_result_safe}")