{"id":5056,"library":"segtok","title":"segtok: Sentence Segmentation and Word Tokenization","description":"Segtok is a fast, rule-based Python library for sentence segmentation and word tokenization. It is designed for well-orthographed texts, particularly in English, German, and Romance languages, offering high precision and Unicode support. The current version is 1.5.11. While functional, it is largely superseded by 'syntok' (segtok v2) which offers improved performance and handles more edge cases. It is in a maintenance phase with no active development.","status":"maintenance","version":"1.5.11","language":"en","source_language":"en","source_url":"https://github.com/fnl/segtok","tags":["nlp","text-processing","tokenization","segmentation","linguistics"],"install":[{"cmd":"pip install segtok","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core dependency for pattern-based segmentation and tokenization. On Linux, requires Python development headers (python-dev/python3-dev) for compilation.","package":"regex","optional":false}],"imports":[{"symbol":"split_multi","correct":"from segtok.segmenter import split_multi"},{"symbol":"web_tokenizer","correct":"from segtok.tokenizer import web_tokenizer"},{"symbol":"split_contractions","correct":"from segtok.tokenizer import split_contractions"},{"symbol":"word_tokenizer","correct":"from segtok.tokenizer import word_tokenizer"}],"quickstart":{"code":"from segtok.segmenter import split_multi\nfrom segtok.tokenizer import web_tokenizer, split_contractions\n\ntext = \"Hello, Mr. Man. He smiled!! This, i.e. that, is it. Don't worry.\"\nsentences = split_multi(text)\n\nall_tokens = []\nfor sentence in sentences:\n    tokens = list(split_contractions(web_tokenizer(sentence)))\n    all_tokens.append(tokens)\n\nprint(\"Original Text:\", text)\nprint(\"\\nSentences:\")\nfor s in sentences:\n    print(f\"- {s}\")\n\nprint(\"\\nTokens per sentence:\")\nfor i, tokens in enumerate(all_tokens):\n    print(f\"Sentence {i+1}: {tokens}\")","lang":"python","description":"This quickstart demonstrates basic sentence segmentation using `split_multi` and then tokenizes each sentence using `web_tokenizer` followed by `split_contractions` for English-specific handling."},"warnings":[{"fix":"Consider migrating to `syntok`. Install with `pip install syntok` and adjust imports and usage patterns. `syntok` is Python 3.6+ only.","message":"The `segtok` library is largely superseded by `syntok` (segtok v2), its direct successor. `syntok` offers better performance and fixes several tricky issues, particularly with sentence terminal markers not followed by spaces.","severity":"breaking","affected_versions":"All segtok versions"},{"fix":"Install the necessary development packages: `sudo apt-get install python3-dev` (Debian/Ubuntu) or `sudo yum install python3-devel` (CentOS/RHEL) before installing `segtok`.","message":"On Linux systems, installing the `regex` dependency (a core requirement for `segtok`) may fail if Python development headers (`python-dev` or `python3-dev`) are not installed.","severity":"gotcha","affected_versions":"All versions on Linux"},{"fix":"For non-Indo-European languages, evaluate alternatives designed for those specific linguistic characteristics. Do not assume `segtok` will perform well out-of-the-box.","message":"`segtok` is specifically designed and tuned for Indo-European languages (e.g., English, German, Spanish). Its performance and correctness may degrade significantly for other language families, such as CJK languages.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If upgrading to `syntok` is desired, ensure your project runs on Python 3.6+.","message":"While `segtok` itself works with Python 2.7 and 3.5+, its recommended successor, `syntok`, requires Python 3.6 or newer due to its reliance on the `typing` module.","severity":"deprecated","affected_versions":"<=1.5.11"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}