Segments

2.4.0 · active · verified Sun Apr 12

Segments provides functions to tokenize and segment strings of text into individual characters or graphemes, and into segments according to orthography profiles. It is particularly useful for linguistic data processing using CLDF (Cross-Linguistic Data Formats). The library typically sees a few releases per year, with major versions introducing updates to Unicode standards.

Warnings

Install

Imports

Quickstart

Demonstrates basic Unicode grapheme tokenization and custom segmentation using an orthography profile.

import segments

# Unicode grapheme segmentation (standard Unicode rules)
text_unicode = 'ŋ͡m'
graphemes = segments.tokenize(text_unicode)
print(f"Unicode graphemes for '{text_unicode}': {graphemes}")

# Segmentation using an orthography profile
# (example profile for 'ph', 'ch', 'th' as single segments)
orthography_profile = {
    "rules": [
        ["ph", "pʰ"],
        ["ch", "cʰ"],
        ["th", "tʰ"]
    ]
}
segmenter = segments.Segmenter(profile=orthography_profile)
text_profile = 'tʰaiph'
segments_profile = segmenter.segment(text_profile)
print(f"Profile segments for '{text_profile}': {segments_profile}")

# Expected output for verification
assert graphemes == ['ŋ', '͡', 'm']
assert segments_profile == ['tʰ', 'ai', 'pʰ']

view raw JSON →