unicategories

0.1.2 · active · verified Thu Apr 16

unicategories is a Python library that provides a Unicode category database, generated and cached on setup. It exposes a dictionary of `RangeGroup` instances, containing all Unicode category character ranges detected on your system. This module offers an efficient way to work with Unicode character classifications, such as 'Letter, uppercase' (Lu) or 'Number, decimal digit' (Nd), by storing ranges rather than individual characters for memory efficiency. The current version is 0.1.2, released on April 2, 2023.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to import the `categories` dictionary, access `RangeGroup` instances by category code (e.g., 'Lu' for uppercase letters), retrieve characters or code points using `characters()` and `codes()` iterators, and check for character inclusion with `has()`.

from unicategories import categories

# Get an iterator for all Unicode uppercase characters
upper_characters_iterator = categories['Lu'].characters()
print(f"First 10 uppercase characters: {''.join(list(upper_characters_iterator)[:10])}")

# Check if a character belongs to a specific category
is_digit = categories['Nd'].has('7')
print(f"Is '7' a decimal digit? {is_digit}")

is_lowercase_a = categories['Ll'].has('a')
print(f"Is 'a' a lowercase letter? {is_lowercase_a}")

# Get an iterator for all Unicode code points in a category
code_points_iterator = categories['Zs'].codes() # Zs: Space Separator
print(f"First 5 space separator code points: {list(code_points_iterator)[:5]}")

view raw JSON →