Unicode Segmentation and Width for Python using Rust

0.2.4 · active · verified Thu Apr 16

unicode-segmentation-rs provides Python bindings for the Rust `unicode-segmentation` and `unicode-width` crates, offering robust Unicode text segmentation (grapheme clusters, words, sentences) and display width calculation according to Unicode standards. It is currently at version 0.2.4 and is an actively maintained library, with updates often tied to new Unicode standard releases in its underlying Rust dependencies.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates basic usage for grapheme, word, and sentence segmentation, as well as calculating the display width of a string.

import unicode_segmentation_rs

text = "Hello 👨‍👩‍👧‍👦 World. How are you?"

# Grapheme clusters (user-perceived characters)
graphemes = unicode_segmentation_rs.graphemes(text, is_extended=True)
print(f"Graphemes: {graphemes}")

# Unicode words (excludes punctuation and whitespace)
words = unicode_segmentation_rs.unicode_words(text)
print(f"Words: {words}")

# Sentences
sentences = unicode_segmentation_rs.unicode_sentences(text)
print(f"Sentences: {sentences}")

# Display width
width = unicode_segmentation_rs.text_width("你好, World!")
print(f"Display width of '你好, World!': {width}")

view raw JSON →