unicodedata2
unicodedata2 is a backport of the `unicodedata` module from the Python standard library, updated to include the latest Unicode versions. It provides access to the Unicode character database, enabling functions like querying character properties (name, category, numeric value) and normalizing Unicode strings. The current version is 17.0.1, and it typically releases new major versions to align with updates to the Unicode standard.
Warnings
- breaking Major version updates (e.g., 17.0.0, 16.0.0) correspond to new Unicode Standard releases. Code relying on specific character properties or the existence of certain characters might behave differently or break with new Unicode versions due to additions, changes, or deprecations in the standard.
- breaking Support for End-of-Life (EOL) Python versions is periodically dropped. For example, version 14.0.0 dropped support for Python 2.7 and 3.5, and older versions removed support for Python 3.3 and 3.4.
- gotcha The `unicodedata2` library does not automatically replace the standard library's `unicodedata` module. You must explicitly `import unicodedata2` to access the updated Unicode character database. If you `import unicodedata`, you will use the older, built-in data tied to your Python interpreter's version.
- gotcha New Python versions might initially encounter build issues with `unicodedata2` until the library releases an update. For instance, there were initial problems building `unicodedata2` with Python 3.11 that were later resolved in version 15.0.0.
Install
-
pip install unicodedata2
Imports
- unicodedata2
import unicodedata2
Quickstart
import unicodedata2
# Get character name
char = 'é'
name = unicodedata2.name(char)
print(f"Character: '{char}', Name: {name}")
# Get character category
category = unicodedata2.category(char)
print(f"Category for '{char}': {category}")
# Normalize a Unicode string
s1 = 'café'
s2 = 'cafe\u0301' # 'e' followed by combining acute accent
print(f"String 1: '{s1}', String 2: '{s2}'")
print(f"Are they equal? {s1 == s2}")
normalized_s1 = unicodedata2.normalize('NFC', s1)
normalized_s2 = unicodedata2.normalize('NFC', s2)
print(f"Normalized S1 (NFC): '{normalized_s1}'")
print(f"Normalized S2 (NFC): '{normalized_s2}'")
print(f"Are they equal after NFC? {normalized_s1 == normalized_s2}")