segtok: Sentence Segmentation and Word Tokenization
Segtok is a fast, rule-based Python library for sentence segmentation and word tokenization. It is designed for well-orthographed texts, particularly in English, German, and Romance languages, offering high precision and Unicode support. The current version is 1.5.11. While functional, it is largely superseded by 'syntok' (segtok v2) which offers improved performance and handles more edge cases. It is in a maintenance phase with no active development.
Warnings
- breaking The `segtok` library is largely superseded by `syntok` (segtok v2), its direct successor. `syntok` offers better performance and fixes several tricky issues, particularly with sentence terminal markers not followed by spaces.
- gotcha On Linux systems, installing the `regex` dependency (a core requirement for `segtok`) may fail if Python development headers (`python-dev` or `python3-dev`) are not installed.
- gotcha `segtok` is specifically designed and tuned for Indo-European languages (e.g., English, German, Spanish). Its performance and correctness may degrade significantly for other language families, such as CJK languages.
- deprecated While `segtok` itself works with Python 2.7 and 3.5+, its recommended successor, `syntok`, requires Python 3.6 or newer due to its reliance on the `typing` module.
Install
-
pip install segtok
Imports
- split_multi
from segtok.segmenter import split_multi
- web_tokenizer
from segtok.tokenizer import web_tokenizer
- split_contractions
from segtok.tokenizer import split_contractions
- word_tokenizer
from segtok.tokenizer import word_tokenizer
Quickstart
from segtok.segmenter import split_multi
from segtok.tokenizer import web_tokenizer, split_contractions
text = "Hello, Mr. Man. He smiled!! This, i.e. that, is it. Don't worry."
sentences = split_multi(text)
all_tokens = []
for sentence in sentences:
tokens = list(split_contractions(web_tokenizer(sentence)))
all_tokens.append(tokens)
print("Original Text:", text)
print("\nSentences:")
for s in sentences:
print(f"- {s}")
print("\nTokens per sentence:")
for i, tokens in enumerate(all_tokens):
print(f"Sentence {i+1}: {tokens}")