{"id":8751,"library":"unicategories","title":"unicategories","description":"unicategories is a Python library that provides a Unicode category database, generated and cached on setup. It exposes a dictionary of `RangeGroup` instances, containing all Unicode category character ranges detected on your system. This module offers an efficient way to work with Unicode character classifications, such as 'Letter, uppercase' (Lu) or 'Number, decimal digit' (Nd), by storing ranges rather than individual characters for memory efficiency. The current version is 0.1.2, released on April 2, 2023.","status":"active","version":"0.1.2","language":"en","source_language":"en","source_url":"https://github.com/ergoithz/unicategories","tags":["unicode","character","category","database","utility"],"install":[{"cmd":"pip install unicategories","lang":"bash","label":"Install stable version"}],"dependencies":[],"imports":[{"symbol":"categories","correct":"from unicategories import categories"}],"quickstart":{"code":"from unicategories import categories\n\n# Get an iterator for all Unicode uppercase characters\nupper_characters_iterator = categories['Lu'].characters()\nprint(f\"First 10 uppercase characters: {''.join(list(upper_characters_iterator)[:10])}\")\n\n# Check if a character belongs to a specific category\nis_digit = categories['Nd'].has('7')\nprint(f\"Is '7' a decimal digit? {is_digit}\")\n\nis_lowercase_a = categories['Ll'].has('a')\nprint(f\"Is 'a' a lowercase letter? {is_lowercase_a}\")\n\n# Get an iterator for all Unicode code points in a category\ncode_points_iterator = categories['Zs'].codes() # Zs: Space Separator\nprint(f\"First 5 space separator code points: {list(code_points_iterator)[:5]}\")","lang":"python","description":"This quickstart demonstrates how to import the `categories` dictionary, access `RangeGroup` instances by category code (e.g., 'Lu' for uppercase letters), retrieve characters or code points using `characters()` and `codes()` iterators, and check for character inclusion with `has()`."},"warnings":[{"fix":"Use `list(categories['Lu'].characters())` if a full list is required, but be mindful of memory usage for very large categories.","message":"The library primarily uses iterators (e.g., `characters()`, `codes()`) for memory efficiency. If you need a complete list, remember to explicitly convert the iterator to a list or another collection, which will consume more memory.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure your project is running on Python 3.5 or newer. Python 3 handles Unicode natively, reducing potential encoding issues.","message":"While `unicategories` supports Python 2.7, Python 2 is end-of-life. It is strongly recommended to use this library with Python 3.5+ to ensure security, maintainability, and compatibility with the latest Unicode standards and Python features.","severity":"gotcha","affected_versions":"All versions supporting Python 2.7"},{"fix":"Combine `unicategories` for category-based filtering/lookup with `unicodedata` for individual character properties. Example: `import unicodedata; char_name = unicodedata.name('A')`.","message":"This library provides access to Unicode *categories*. For other Unicode character properties (like name, numeric value, bidirectional class), use Python's built-in `unicodedata` module.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Always specify the correct encoding, preferably UTF-8, when opening files or decoding byte strings. For file operations: `with open('filename.txt', 'r', encoding='utf-8') as f: ...`. For byte strings: `my_bytes.decode('utf-8')`.","cause":"Attempting to read or process text data that was encoded in one format (e.g., UTF-8) but is being decoded using a different, incompatible codec (e.g., Windows-1252 or a default system encoding like 'charmap') by Python. This is a common issue when dealing with files or external data sources not explicitly specified as UTF-8.","error":"UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>"},{"fix":"Use raw strings by prefixing with `r` (e.g., `r'C:\\Users\\...'`) or double the backslashes (`'C:\\\\Users\\\\...'`). Alternatively, use `pathlib.Path` for platform-agnostic path handling: `from pathlib import Path; path = Path('C:/Users/...')`.","cause":"This error often occurs on Windows when a backslash (`\\`) in a string literal, especially in file paths, is misinterpreted as the start of a Unicode escape sequence (`\\u` or `\\U`) and is not followed by valid hexadecimal digits.","error":"SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position X-Y: truncated \\uXXXX escape"},{"fix":"Explicitly encode the Unicode string into a suitable byte encoding, such as UTF-8, using the `.encode()` method: `my_unicode_string.encode('utf-8')`.","cause":"Attempting to convert a Unicode string containing non-ASCII characters to an ASCII byte string without specifying an appropriate encoding, or when the target encoding (like 'ascii') cannot represent the characters present.","error":"UnicodeEncodeError: 'ascii' codec can't encode character '\\uXXXX' in position Y: ordinal not in range(128)"}]}