unicategories
unicategories is a Python library that provides a Unicode category database, generated and cached on setup. It exposes a dictionary of `RangeGroup` instances, containing all Unicode category character ranges detected on your system. This module offers an efficient way to work with Unicode character classifications, such as 'Letter, uppercase' (Lu) or 'Number, decimal digit' (Nd), by storing ranges rather than individual characters for memory efficiency. The current version is 0.1.2, released on April 2, 2023.
Common errors
-
UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>
cause Attempting to read or process text data that was encoded in one format (e.g., UTF-8) but is being decoded using a different, incompatible codec (e.g., Windows-1252 or a default system encoding like 'charmap') by Python. This is a common issue when dealing with files or external data sources not explicitly specified as UTF-8.fixAlways specify the correct encoding, preferably UTF-8, when opening files or decoding byte strings. For file operations: `with open('filename.txt', 'r', encoding='utf-8') as f: ...`. For byte strings: `my_bytes.decode('utf-8')`. -
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position X-Y: truncated \uXXXX escape
cause This error often occurs on Windows when a backslash (`\`) in a string literal, especially in file paths, is misinterpreted as the start of a Unicode escape sequence (`\u` or `\U`) and is not followed by valid hexadecimal digits.fixUse raw strings by prefixing with `r` (e.g., `r'C:\Users\...'`) or double the backslashes (`'C:\\Users\\...'`). Alternatively, use `pathlib.Path` for platform-agnostic path handling: `from pathlib import Path; path = Path('C:/Users/...')`. -
UnicodeEncodeError: 'ascii' codec can't encode character '\uXXXX' in position Y: ordinal not in range(128)
cause Attempting to convert a Unicode string containing non-ASCII characters to an ASCII byte string without specifying an appropriate encoding, or when the target encoding (like 'ascii') cannot represent the characters present.fixExplicitly encode the Unicode string into a suitable byte encoding, such as UTF-8, using the `.encode()` method: `my_unicode_string.encode('utf-8')`.
Warnings
- gotcha The library primarily uses iterators (e.g., `characters()`, `codes()`) for memory efficiency. If you need a complete list, remember to explicitly convert the iterator to a list or another collection, which will consume more memory.
- gotcha While `unicategories` supports Python 2.7, Python 2 is end-of-life. It is strongly recommended to use this library with Python 3.5+ to ensure security, maintainability, and compatibility with the latest Unicode standards and Python features.
- gotcha This library provides access to Unicode *categories*. For other Unicode character properties (like name, numeric value, bidirectional class), use Python's built-in `unicodedata` module.
Install
-
pip install unicategories
Imports
- categories
from unicategories import categories
Quickstart
from unicategories import categories
# Get an iterator for all Unicode uppercase characters
upper_characters_iterator = categories['Lu'].characters()
print(f"First 10 uppercase characters: {''.join(list(upper_characters_iterator)[:10])}")
# Check if a character belongs to a specific category
is_digit = categories['Nd'].has('7')
print(f"Is '7' a decimal digit? {is_digit}")
is_lowercase_a = categories['Ll'].has('a')
print(f"Is 'a' a lowercase letter? {is_lowercase_a}")
# Get an iterator for all Unicode code points in a category
code_points_iterator = categories['Zs'].codes() # Zs: Space Separator
print(f"First 5 space separator code points: {list(code_points_iterator)[:5]}")