pyuca: Unicode Collation Algorithm

1.2 · maintenance · verified Thu Apr 16

pyuca is a pure-Python implementation of the Unicode Collation Algorithm (UCA), designed to sort non-English strings correctly by accounting for linguistic rules such as accents, contractions, and expansions. It implements multi-level comparison and passes UCA conformance tests for various Unicode versions, depending on the Python environment's `unicodedata` library. The library's current version is 1.2, released in September 2017. While functional and still used (e.g., in Fedora packages), it is not actively maintained and may be considered slightly obsolete by some.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a `Collator` and use its `sort_key` method with Python's built-in `sorted()` function to achieve linguistically correct sorting of Unicode strings. The `Collator` automatically adapts to the Unicode version supported by your Python installation.

from pyuca import Collator

def sort_strings_pyuca(strings):
    # Initialize the Collator. It automatically selects the appropriate
    # Unicode version based on your Python environment.
    c = Collator()
    
    # Use the collator's sort_key method with Python's built-in sorted()
    sorted_list = sorted(strings, key=c.sort_key)
    return sorted_list

# Example usage:
words = ["cafe", "caff", "café", "cozy", "česky"]
print(f"Original list: {words}")
sorted_words = sort_strings_pyuca(words)
print(f"Sorted list (pyuca): {sorted_words}")

# Demonstrating behavior with special characters
assert sort_strings_pyuca(["cafe", "caff", "café"]) == ["cafe", "café", "caff"]

view raw JSON →