{"id":8752,"library":"unicode-segmentation-rs","title":"Unicode Segmentation and Width for Python using Rust","description":"unicode-segmentation-rs provides Python bindings for the Rust `unicode-segmentation` and `unicode-width` crates, offering robust Unicode text segmentation (grapheme clusters, words, sentences) and display width calculation according to Unicode standards. It is currently at version 0.2.4 and is an actively maintained library, with updates often tied to new Unicode standard releases in its underlying Rust dependencies.","status":"active","version":"0.2.4","language":"en","source_language":"en","source_url":"https://github.com/WeblateOrg/unicode-segmentation-rs","tags":["unicode","segmentation","graphemes","words","sentences","text processing","rust","bindings"],"install":[{"cmd":"pip install unicode-segmentation-rs","lang":"bash","label":"Install from PyPI"}],"dependencies":[],"imports":[{"symbol":"graphemes","correct":"from unicode_segmentation_rs import graphemes"},{"symbol":"unicode_words","correct":"from unicode_segmentation_rs import unicode_words"},{"symbol":"unicode_sentences","correct":"from unicode_segmentation_rs import unicode_sentences"},{"symbol":"text_width","correct":"from unicode_segmentation_rs import text_width"},{"symbol":"gettext_wrap","correct":"from unicode_segmentation_rs import gettext_wrap"}],"quickstart":{"code":"import unicode_segmentation_rs\n\ntext = \"Hello 👨‍👩‍👧‍👦 World. How are you?\"\n\n# Grapheme clusters (user-perceived characters)\ngraphemes = unicode_segmentation_rs.graphemes(text, is_extended=True)\nprint(f\"Graphemes: {graphemes}\")\n\n# Unicode words (excludes punctuation and whitespace)\nwords = unicode_segmentation_rs.unicode_words(text)\nprint(f\"Words: {words}\")\n\n# Sentences\nsentences = unicode_segmentation_rs.unicode_sentences(text)\nprint(f\"Sentences: {sentences}\")\n\n# Display width\nwidth = unicode_segmentation_rs.text_width(\"你好, World!\")\nprint(f\"Display width of '你好, World!': {width}\")","lang":"python","description":"Demonstrates basic usage for grapheme, word, and sentence segmentation, as well as calculating the display width of a string."},"warnings":[{"fix":"Always pass `is_extended=True` to `graphemes()` and `grapheme_indices()` unless you specifically require legacy grapheme clustering behavior.","message":"When using `graphemes()` or `grapheme_indices()`, it is highly recommended to set `is_extended=True`. This adheres to the Unicode Standard Annex #29 for 'extended grapheme clusters', which represents user-perceived characters. Failing to do so can lead to non-intuitive or incorrect segmentation for complex Unicode sequences, such as emojis or combining characters.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Iterate over the results of segmentation functions (e.g., `for grapheme in unicode_segmentation_rs.graphemes(text):`) rather than attempting direct string indexing with arbitrary offsets.","message":"Directly indexing Python strings after performing Unicode segmentation (e.g., trying to access `my_string[i]` based on grapheme cluster counts) is an anti-pattern. Unicode text segmentation algorithms are inherently streaming, and direct indexing into a string by a 'grapheme index' is inefficient and often indicative of a misunderstanding of Unicode text model. The library provides lists of segmented strings or indices, which should be iterated over, not used for direct string indexing.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure input strings are valid Unicode where possible. For critical applications, implement robust error handling around calls to `unicode_segmentation_rs` functions, especially with untrusted input.","message":"While the Python bindings aim for stability, the underlying Rust `unicode-segmentation` crate has historically encountered panics (e.g., 'byte index is not a char boundary' or arithmetic overflows) with highly malformed or edge-case Unicode input, particularly with its lower-level cursor APIs. Although the Python layer should convert Rust panics into Python exceptions, unexpected input could still lead to issues or crashes in rare circumstances.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"`len(unicode_segmentation_rs.graphemes(my_string, is_extended=True))` will provide the count of user-perceived characters.","cause":"Python's built-in `len()` counts Unicode code points, not user-perceived characters (grapheme clusters). This is a common misconception when dealing with complex scripts or emojis.","error":"len(my_string) returning an incorrect 'character count'"},{"fix":"Always ensure that string inputs are correctly decoded from bytes if they originate from external sources (e.g., `my_bytes.decode('utf-8', errors='replace')`). The Python bindings should generally prevent direct exposure of raw byte-level issues unless the string itself is fundamentally corrupted.","cause":"The underlying Rust library expects valid UTF-8 for its string operations. While Python strings are inherently Unicode-aware, constructing them from improperly decoded bytes or passing corrupted string data can lead to internal inconsistencies in the Rust layer.","error":"Crash or unexpected behavior when processing malformed UTF-8 input (e.g., `ValueError: byte index is not a char boundary`)"}]}