uniseg
raw JSON → 0.10.1 verified Fri May 01 auth: no python
The uniseg library determines Unicode text segmentation boundaries, such as grapheme clusters, words, sentences, and line break opportunities, following the Unicode Standard Annex #29 and UAX #14. Current version is 0.10.1, requires Python >=3.9, and released with no fixed cadence.
pip install uniseg Common errors
error TypeError: 'generator' object is not subscriptable ↓
cause Calling index or slice on the iterator returned by segmentation functions.
fix
Use list() to convert: list(grapheme_clusters(text))[0]
error AttributeError: module 'uniseg' has no attribute 'grapheme_clusters' ↓
cause Using an older version of uniseg that might have different function names or import path.
fix
Upgrade to 0.10.1 via pip install --upgrade uniseg. Check import is exact: from uniseg import grapheme_clusters
error UnicodeEncodeError: 'charmap' codec can't encode character ↓
cause Printing Unicode characters to a terminal that does not support UTF-8.
fix
Set environment variable PYTHONIOENCODING=utf-8, or encode output manually.
Warnings
gotcha Functions return iterators, not lists. Call list() to inspect or store. ↓
fix Wrap calls in list() if you need to index or reuse results.
gotcha Grapheme cluster and word segmentation depend on Unicode version bundled with library. Ensure system Unicode data is not mixed. ↓
fix Check uniseg.UNICODE_VERSION for the Unicode version used.
gotcha The word_segment function returns segments as strings including punctuation and spaces. Do not assume it returns only words. ↓
fix Filter results if only alphanumeric words are needed.
Imports
- grapheme_clusters
from uniseg import grapheme_clusters - word_segment
from uniseg import word_segment - sentences
from uniseg import sentences - line_break
from uniseg import line_break - GraphemeCluster
from uniseg import GraphemeCluster
Quickstart
from uniseg import grapheme_clusters, word_segment, sentences, line_break
text = "Hello World! 🌍"
print("Grapheme clusters:", list(grapheme_clusters(text)))
print("Words:", list(word_segment(text)))
print("Sentences:", list(sentences(text)))
print("Line breaks:", list(line_break(text)))