pysbd (Python Sentence Boundary Disambiguation)
pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection library that works out-of-the-box across many languages. It aims to provide accurate sentence segmentation even with complex text, abbreviations, and varied punctuation, offering an alternative to neural network-based approaches. The current version is 0.3.4, and the project appears to be actively maintained.
Warnings
- breaking When integrating with spaCy, examples for spaCy v2.x using `nlp.add_pipe(PySBDFactory(nlp))` are not compatible with spaCy v3.x. The `add_pipe` API changed to expect a string name of the registered component factory.
- gotcha Segmentation results might differ slightly when using `pysbd.Segmenter` directly compared to using `pysbd` as a spaCy pipeline component, especially with quoted text or complex punctuation.
- gotcha The `clean=True` parameter in the `Segmenter` constructor performs aggressive pre-filtering of the input text, removing repeated punctuation, line breaks, URLs, and HTML tags. This might alter the original text more than desired for certain NLP tasks.
- gotcha By default, `pysbd.Segmenter` returns a list of strings. If you require character offsets into the original text for non-destructive tokenization, you must initialize `Segmenter` with `char_span=True`.
- gotcha While highly accurate, `pysbd` is a rule-based system implemented in Python. It may be slower compared to some sentence boundary detection alternatives that are implemented in lower-level languages like C++ or optimized using Cython.
Install
-
pip install pysbd -
conda install anaconda::pysbd
Imports
- Segmenter
import pysbd; segmenter = pysbd.Segmenter(...)
- PySBDFactory
from pysbd.utils import PySBDFactory
Quickstart
import pysbd
text = "Dr. Smith went to the U.S. last week. He said, 'Hello!' How are you?"
# Initialize segmenter for English
segmenter = pysbd.Segmenter(language="en", clean=False)
sentences = segmenter.segment(text)
for i, sent in enumerate(sentences):
print(f"Sentence {i+1}: {sent}")