{"id":5055,"library":"segments","title":"Segments","description":"Segments provides functions to tokenize and segment strings of text into individual characters or graphemes, and into segments according to orthography profiles. It is particularly useful for linguistic data processing using CLDF (Cross-Linguistic Data Formats). The library typically sees a few releases per year, with major versions introducing updates to Unicode standards.","status":"active","version":"2.4.0","language":"en","source_language":"en","source_url":"https://github.com/cldf/segments","tags":["text processing","linguistics","unicode","segmentation","tokenization","graphemes","cldf"],"install":[{"cmd":"pip install segments","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Internal object modeling and data structures.","package":"attrs","optional":false},{"reason":"Utilities for CLDF data structures, particularly path handling and configuration.","package":"clldutils","optional":false},{"reason":"Handles CLDF data structures, including orthography profiles used by Segmenter.","package":"pycldf","optional":false}],"imports":[{"symbol":"tokenize","correct":"from segments import tokenize"},{"symbol":"Segmenter","correct":"from segments import Segmenter"}],"quickstart":{"code":"import segments\n\n# Unicode grapheme segmentation (standard Unicode rules)\ntext_unicode = 'ŋ͡m'\ngraphemes = segments.tokenize(text_unicode)\nprint(f\"Unicode graphemes for '{text_unicode}': {graphemes}\")\n\n# Segmentation using an orthography profile\n# (example profile for 'ph', 'ch', 'th' as single segments)\northography_profile = {\n    \"rules\": [\n        [\"ph\", \"pʰ\"],\n        [\"ch\", \"cʰ\"],\n        [\"th\", \"tʰ\"]\n    ]\n}\nsegmenter = segments.Segmenter(profile=orthography_profile)\ntext_profile = 'tʰaiph'\nsegments_profile = segmenter.segment(text_profile)\nprint(f\"Profile segments for '{text_profile}': {segments_profile}\")\n\n# Expected output for verification\nassert graphemes == ['ŋ', '͡', 'm']\nassert segments_profile == ['tʰ', 'ai', 'pʰ']","lang":"python","description":"Demonstrates basic Unicode grapheme tokenization and custom segmentation using an orthography profile."},"warnings":[{"fix":"For rule-based segmentation, always instantiate `Segmenter` with a `profile` and use its `.segment()` method. Use `segments.tokenize()` exclusively for standard Unicode grapheme clustering.","message":"Distinguish carefully between `segments.tokenize` and `Segmenter.segment`. `segments.tokenize` performs basic Unicode grapheme cluster segmentation only. `Segmenter.segment` is used to apply custom orthography rules provided via a `profile`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Review tokenization outputs for critical linguistic data when upgrading from versions <2.0.0 to ensure consistency. Adjust downstream processing if specific grapheme clustering assumptions have changed due to updated Unicode interpretations.","message":"Version 2.0.0 introduced 'Unicode Standard tokenization', which may result in subtle changes to grapheme cluster output for certain complex character sequences compared to pre-2.0.0 versions. This is a behavioral change based on updated Unicode standards, not an API break.","severity":"breaking","affected_versions":">=2.0.0"},{"fix":"Ensure the `profile` dictionary strictly adheres to the `{'rules': [[...]]}` structure. Refer to `pycldf` documentation for advanced orthography profile handling if loading from files or complex scenarios.","message":"The `profile` argument for `segments.Segmenter` expects a dictionary with a 'rules' key, formatted as a list of lists (e.g., `[['rule_text', 'ipa_trans']]`). Directly passing file paths or non-conforming dictionaries will fail.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}