uniseg

raw JSON →
0.10.1 verified Fri May 01 auth: no python

The uniseg library determines Unicode text segmentation boundaries, such as grapheme clusters, words, sentences, and line break opportunities, following the Unicode Standard Annex #29 and UAX #14. Current version is 0.10.1, requires Python >=3.9, and released with no fixed cadence.

pip install uniseg
error TypeError: 'generator' object is not subscriptable
cause Calling index or slice on the iterator returned by segmentation functions.
fix
Use list() to convert: list(grapheme_clusters(text))[0]
error AttributeError: module 'uniseg' has no attribute 'grapheme_clusters'
cause Using an older version of uniseg that might have different function names or import path.
fix
Upgrade to 0.10.1 via pip install --upgrade uniseg. Check import is exact: from uniseg import grapheme_clusters
error UnicodeEncodeError: 'charmap' codec can't encode character
cause Printing Unicode characters to a terminal that does not support UTF-8.
fix
Set environment variable PYTHONIOENCODING=utf-8, or encode output manually.
gotcha Functions return iterators, not lists. Call list() to inspect or store.
fix Wrap calls in list() if you need to index or reuse results.
gotcha Grapheme cluster and word segmentation depend on Unicode version bundled with library. Ensure system Unicode data is not mixed.
fix Check uniseg.UNICODE_VERSION for the Unicode version used.
gotcha The word_segment function returns segments as strings including punctuation and spaces. Do not assume it returns only words.
fix Filter results if only alphanumeric words are needed.

Basic usage: iterate over Unicode segment boundaries.

from uniseg import grapheme_clusters, word_segment, sentences, line_break

text = "Hello World! 🌍"
print("Grapheme clusters:", list(grapheme_clusters(text)))
print("Words:", list(word_segment(text)))
print("Sentences:", list(sentences(text)))
print("Line breaks:", list(line_break(text)))