Segments
Segments provides functions to tokenize and segment strings of text into individual characters or graphemes, and into segments according to orthography profiles. It is particularly useful for linguistic data processing using CLDF (Cross-Linguistic Data Formats). The library typically sees a few releases per year, with major versions introducing updates to Unicode standards.
Warnings
- gotcha Distinguish carefully between `segments.tokenize` and `Segmenter.segment`. `segments.tokenize` performs basic Unicode grapheme cluster segmentation only. `Segmenter.segment` is used to apply custom orthography rules provided via a `profile`.
- breaking Version 2.0.0 introduced 'Unicode Standard tokenization', which may result in subtle changes to grapheme cluster output for certain complex character sequences compared to pre-2.0.0 versions. This is a behavioral change based on updated Unicode standards, not an API break.
- gotcha The `profile` argument for `segments.Segmenter` expects a dictionary with a 'rules' key, formatted as a list of lists (e.g., `[['rule_text', 'ipa_trans']]`). Directly passing file paths or non-conforming dictionaries will fail.
Install
-
pip install segments
Imports
- tokenize
from segments import tokenize
- Segmenter
from segments import Segmenter
Quickstart
import segments
# Unicode grapheme segmentation (standard Unicode rules)
text_unicode = 'ŋ͡m'
graphemes = segments.tokenize(text_unicode)
print(f"Unicode graphemes for '{text_unicode}': {graphemes}")
# Segmentation using an orthography profile
# (example profile for 'ph', 'ch', 'th' as single segments)
orthography_profile = {
"rules": [
["ph", "pʰ"],
["ch", "cʰ"],
["th", "tʰ"]
]
}
segmenter = segments.Segmenter(profile=orthography_profile)
text_profile = 'tʰaiph'
segments_profile = segmenter.segment(text_profile)
print(f"Profile segments for '{text_profile}': {segments_profile}")
# Expected output for verification
assert graphemes == ['ŋ', '͡', 'm']
assert segments_profile == ['tʰ', 'ai', 'pʰ']