segtok: Sentence Segmentation and Word Tokenization

1.5.11 · maintenance · verified Sun Apr 12

Segtok is a fast, rule-based Python library for sentence segmentation and word tokenization. It is designed for well-orthographed texts, particularly in English, German, and Romance languages, offering high precision and Unicode support. The current version is 1.5.11. While functional, it is largely superseded by 'syntok' (segtok v2) which offers improved performance and handles more edge cases. It is in a maintenance phase with no active development.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic sentence segmentation using `split_multi` and then tokenizes each sentence using `web_tokenizer` followed by `split_contractions` for English-specific handling.

from segtok.segmenter import split_multi
from segtok.tokenizer import web_tokenizer, split_contractions

text = "Hello, Mr. Man. He smiled!! This, i.e. that, is it. Don't worry."
sentences = split_multi(text)

all_tokens = []
for sentence in sentences:
    tokens = list(split_contractions(web_tokenizer(sentence)))
    all_tokens.append(tokens)

print("Original Text:", text)
print("\nSentences:")
for s in sentences:
    print(f"- {s}")

print("\nTokens per sentence:")
for i, tokens in enumerate(all_tokens):
    print(f"Sentence {i+1}: {tokens}")

view raw JSON →