pyuca: Unicode Collation Algorithm
pyuca is a pure-Python implementation of the Unicode Collation Algorithm (UCA), designed to sort non-English strings correctly by accounting for linguistic rules such as accents, contractions, and expansions. It implements multi-level comparison and passes UCA conformance tests for various Unicode versions, depending on the Python environment's `unicodedata` library. The library's current version is 1.2, released in September 2017. While functional and still used (e.g., in Fedora packages), it is not actively maintained and may be considered slightly obsolete by some.
Common errors
-
My non-English strings are not sorting alphabetically correctly with Python's default `sorted()` function.
cause Python's default `sorted()` performs binary (lexicographical) sorting, which does not account for the complex linguistic rules of many languages (e.g., accents, ligatures, contractions, expansions).fixUse `pyuca.Collator` to generate culturally and linguistically correct sort keys: ```python from pyuca import Collator collator = Collator() words = ["résumé", "resume", "résiste"] sorted_words = sorted(words, key=collator.sort_key) # sorted_words will be: ['resume', 'résumé', 'résiste'] ``` -
My application is slow when sorting many Unicode strings, even with `pyuca`.
cause `pyuca` is a pure-Python library, and the Unicode Collation Algorithm itself is computationally intensive. For very large collections of strings, the overhead can be noticeable.fixInitialize the `Collator` object only once and reuse it across multiple sorting operations. If performance remains critical, consider profiling your code and exploring alternative collation libraries that might offer C-backed implementations or better optimization for your specific use case, if available. -
The sorting order for specific characters in my language isn't quite right, even with `pyuca`.
cause `pyuca` primarily implements the Default Unicode Collation Element Table (DUCET). Some languages have highly specific or tailored collation rules that deviate from the DUCET, which `pyuca` does not easily support for custom rules.fixVerify if the expected sorting behavior is a standard UCA rule or a highly localized tailoring. While `pyuca` is not designed for easy custom rule injection, for very specific needs, other internationalization libraries (e.g., `PyICU` for Pythong with ICU, though not a direct `pyuca` alternative) might offer more control over collation rules and tailoring options.
Warnings
- deprecated The `pyuca` library has not seen active development since its last release in September 2017. While functional, new features or bug fixes are not expected, and it may be considered 'slightly obsolete' in favor of more actively maintained libraries, though specific direct Python alternatives are not extensively highlighted in search results.
- gotcha As a pure-Python implementation, `pyuca` can introduce performance overhead for very large datasets or performance-critical applications when compared to libraries with C-extensions or more optimized collation engines.
- gotcha While `pyuca` provides general Unicode collation, implementing highly specific language-tailoring rules (e.g., custom character ordering for a particular dialect) is not directly supported or straightforward. Customizing `allkeys.txt` is complex and error-prone.
- gotcha The specific Unicode Collation Algorithm (UCA) version supported by `pyuca` dynamically depends on the `unicodedata` library version available in your Python environment. Older Python versions might not support the latest Unicode standards.
Install
-
pip install pyuca
Imports
- Collator
from pyuca.collator import Collator_X_Y_Z
from pyuca import Collator
Quickstart
from pyuca import Collator
def sort_strings_pyuca(strings):
# Initialize the Collator. It automatically selects the appropriate
# Unicode version based on your Python environment.
c = Collator()
# Use the collator's sort_key method with Python's built-in sorted()
sorted_list = sorted(strings, key=c.sort_key)
return sorted_list
# Example usage:
words = ["cafe", "caff", "café", "cozy", "česky"]
print(f"Original list: {words}")
sorted_words = sort_strings_pyuca(words)
print(f"Sorted list (pyuca): {sorted_words}")
# Demonstrating behavior with special characters
assert sort_strings_pyuca(["cafe", "caff", "café"]) == ["cafe", "café", "caff"]