Grapheme Unicode Helpers
The `grapheme` library (current version 0.10.0) provides helpers for Unicode grapheme-aware string handling in Python. It enables accurate counting, slicing, and manipulation of strings based on user-perceived characters (graphemes) rather than Unicode code points. The library is actively maintained, supporting recent Unicode standards, and typically releases new versions a few times a year.
Warnings
- breaking Python 3.6 support was dropped with version `0.7.0`. Users on Python 3.6 should pin their `grapheme` dependency to `<0.7.0`.
- breaking The current version `0.10.0` (and `0.9.0` onwards) explicitly requires Python >=3.10. If you are using an older Python version (e.g., 3.8, 3.9), you will need to upgrade your Python environment or use an older `grapheme` version.
- gotcha The library's functions, by nature of grapheme cluster calculation, have a linear time complexity (`O(n)`) relative to string length. For performance-critical applications involving very long strings, consider the trade-off between correctness and speed.
- gotcha Negative indexing (e.g., `grapheme.slice(text, start=-1)`) is currently not supported for `grapheme.slice()` and will raise a `NotImplementedError`.
- gotcha The `in` operator in Python performs substring checks based on Unicode code points. `grapheme.contains()` provides a grapheme-aware substring check, which may yield different results when dealing with multi-codepoint graphemes (e.g., emojis or combining characters).
Install
-
pip install grapheme
Imports
- length
import grapheme grapheme.length('string') - slice
import grapheme grapheme.slice('string', start=0, end=5) - graphemes
import grapheme list(grapheme.graphemes('string'))
Quickstart
import grapheme
rainbow_flag = "🏳️🌈" # An emoji represented by multiple code points
# Correctly count graphemes
visual_length = grapheme.length(rainbow_flag)
print(f"Visual length of '{rainbow_flag}': {visual_length}") # Expected: 1
# Incorrectly count code points with built-in len()
codepoint_length = len(rainbow_flag)
print(f"Code point length of '{rainbow_flag}': {codepoint_length}") # Expected: 4
# Safely slice by graphemes
text = "tamil நி (ni)"
sliced_by_grapheme = grapheme.slice(text, end=7)
print(f"Grapheme-sliced: '{sliced_by_grapheme}'") # Expected: 'tamil நி'
# Unsafely slice by code points
unsafely_sliced = text[:7]
print(f"Codepoint-sliced: '{unsafely_sliced}'") # Expected: 'tamil ந'