Unicode Segmentation and Width for Python using Rust
unicode-segmentation-rs provides Python bindings for the Rust `unicode-segmentation` and `unicode-width` crates, offering robust Unicode text segmentation (grapheme clusters, words, sentences) and display width calculation according to Unicode standards. It is currently at version 0.2.4 and is an actively maintained library, with updates often tied to new Unicode standard releases in its underlying Rust dependencies.
Common errors
-
len(my_string) returning an incorrect 'character count'
cause Python's built-in `len()` counts Unicode code points, not user-perceived characters (grapheme clusters). This is a common misconception when dealing with complex scripts or emojis.fix`len(unicode_segmentation_rs.graphemes(my_string, is_extended=True))` will provide the count of user-perceived characters. -
Crash or unexpected behavior when processing malformed UTF-8 input (e.g., `ValueError: byte index is not a char boundary`)
cause The underlying Rust library expects valid UTF-8 for its string operations. While Python strings are inherently Unicode-aware, constructing them from improperly decoded bytes or passing corrupted string data can lead to internal inconsistencies in the Rust layer.fixAlways ensure that string inputs are correctly decoded from bytes if they originate from external sources (e.g., `my_bytes.decode('utf-8', errors='replace')`). The Python bindings should generally prevent direct exposure of raw byte-level issues unless the string itself is fundamentally corrupted.
Warnings
- gotcha When using `graphemes()` or `grapheme_indices()`, it is highly recommended to set `is_extended=True`. This adheres to the Unicode Standard Annex #29 for 'extended grapheme clusters', which represents user-perceived characters. Failing to do so can lead to non-intuitive or incorrect segmentation for complex Unicode sequences, such as emojis or combining characters.
- gotcha Directly indexing Python strings after performing Unicode segmentation (e.g., trying to access `my_string[i]` based on grapheme cluster counts) is an anti-pattern. Unicode text segmentation algorithms are inherently streaming, and direct indexing into a string by a 'grapheme index' is inefficient and often indicative of a misunderstanding of Unicode text model. The library provides lists of segmented strings or indices, which should be iterated over, not used for direct string indexing.
- gotcha While the Python bindings aim for stability, the underlying Rust `unicode-segmentation` crate has historically encountered panics (e.g., 'byte index is not a char boundary' or arithmetic overflows) with highly malformed or edge-case Unicode input, particularly with its lower-level cursor APIs. Although the Python layer should convert Rust panics into Python exceptions, unexpected input could still lead to issues or crashes in rare circumstances.
Install
-
pip install unicode-segmentation-rs
Imports
- graphemes
from unicode_segmentation_rs import graphemes
- unicode_words
from unicode_segmentation_rs import unicode_words
- unicode_sentences
from unicode_segmentation_rs import unicode_sentences
- text_width
from unicode_segmentation_rs import text_width
- gettext_wrap
from unicode_segmentation_rs import gettext_wrap
Quickstart
import unicode_segmentation_rs
text = "Hello 👨👩👧👦 World. How are you?"
# Grapheme clusters (user-perceived characters)
graphemes = unicode_segmentation_rs.graphemes(text, is_extended=True)
print(f"Graphemes: {graphemes}")
# Unicode words (excludes punctuation and whitespace)
words = unicode_segmentation_rs.unicode_words(text)
print(f"Words: {words}")
# Sentences
sentences = unicode_segmentation_rs.unicode_sentences(text)
print(f"Sentences: {sentences}")
# Display width
width = unicode_segmentation_rs.text_width("你好, World!")
print(f"Display width of '你好, World!': {width}")