pyunormalize Unicode Normalization Library
pyunormalize is a pure-Python library for Unicode normalization (NFC, NFD, NFKC, NFKD) that operates independently of Python's built-in Unicode database. It uses its own dedicated data, ensuring strict conformance to the latest Unicode Standard (currently v17.0.0). New major versions are typically released to align with updates to the Unicode Standard.
Warnings
- breaking Version 17.0.0 dropped official support for Python 3.6 and 3.7. Users on these Python versions should upgrade to Python 3.8 or newer, or stick to an older version of pyunormalize (e.g., < 17.0.0).
- gotcha pyunormalize is designed to use its own Unicode Character Database (UCD), making it independent of the UCD version bundled with your Python interpreter's `unicodedata` module. This is its core feature, but users migrating from `unicodedata` should be aware that results might differ if their Python's `unicodedata` is significantly older or newer than pyunormalize's UCD version.
- gotcha While pyunormalize correctly implements Unicode normalization, be aware of broader security implications of Unicode equivalence. Characters that appear identical after normalization (e.g., compatibility equivalences in NFKC/NFKD) can be represented by different underlying code points, which might be exploited in security contexts (e.g., path traversal, input validation, string comparisons in authentication).
Install
-
pip install pyunormalize
Imports
- NFC
from pyunormalize import NFC
- NFD
from pyunormalize import NFD
- NFKC
from pyunormalize import NFKC
- NFKD
from pyunormalize import NFKD
- UCD_VERSION
from pyunormalize import UCD_VERSION
- normalize (general function)
from pyunormalize import NFC, NFD, NFKC, NFKD
Quickstart
from pyunormalize import NFC, NFD, NFKC, NFKD, UCD_VERSION
# Example string with accented characters
text = "élève"
# Normalize to different forms
nfc_text = NFC(text)
nfd_text = NFD(text)
nfkc_text = NFKC(text)
nfkd_text = NFKD(text)
print(f"Original: {text}")
print(f"NFC: {nfc_text}")
print(f"NFD: {nfd_text}")
print(f"NFKC: {nfkc_text}")
print(f"NFKD: {nfkd_text}")
print(f"Unicode database version: {UCD_VERSION}")