Confusables

1.2.0 · active · verified Mon Apr 13

Confusables is a Python package designed for analyzing and matching words that appear similar but use different Unicode characters. It leverages the official Unicode confusable characters list to detect homoglyphs, which can be useful for applications like identifying malicious fake website names, normalizing text data, or bypassing profanity filters. The library is currently at version 1.2.0 and receives updates as needed, particularly for Unicode character set changes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core functionalities: checking if two strings are confusable, generating a regular expression to match confusable variations of a string, and normalizing a string to a list of possible "normal forms" with ASCII priority.

from confusables import is_confusable, confusable_regex, normalize

# Check if two strings are confusable
print(f"'rover' vs 'ƦỏV3ℛ': {is_confusable('rover', 'ƦỏV3ℛ')}")

# Generate a regex for confusable characters
regex_pattern = confusable_regex('admin', include_character_padding=True)
print(f"Regex for 'admin': {regex_pattern}")

# Normalize a string to its confusable ASCII counterparts
normalized_forms = normalize('micrоsoft', prioritize_alpha=True)
print(f"Normalized forms of 'micrоsoft': {normalized_forms}")

view raw JSON →