{"id":4206,"library":"pysbd","title":"pysbd (Python Sentence Boundary Disambiguation)","description":"pysbd (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection library that works out-of-the-box across many languages. It aims to provide accurate sentence segmentation even with complex text, abbreviations, and varied punctuation, offering an alternative to neural network-based approaches. The current version is 0.3.4, and the project appears to be actively maintained.","status":"active","version":"0.3.4","language":"en","source_language":"en","source_url":"https://github.com/nipunsadvilkar/pySBD","tags":["nlp","sentence segmentation","text processing","rule-based","multilingual"],"install":[{"cmd":"pip install pysbd","lang":"bash","label":"PyPI"},{"cmd":"conda install anaconda::pysbd","lang":"bash","label":"Conda"}],"dependencies":[{"reason":"Optional, for integration as a spaCy pipeline component.","package":"spacy","optional":true}],"imports":[{"note":"Primary class for sentence segmentation","symbol":"Segmenter","correct":"import pysbd; segmenter = pysbd.Segmenter(...)"},{"note":"PySBDFactory is specifically for spaCy integration and resides in pysbd.utils, not directly under pysbd or pysbd.segmenter.","wrong":"from pysbd.segmenter import PySBDFactory","symbol":"PySBDFactory","correct":"from pysbd.utils import PySBDFactory"}],"quickstart":{"code":"import pysbd\n\ntext = \"Dr. Smith went to the U.S. last week. He said, 'Hello!' How are you?\"\n\n# Initialize segmenter for English\nsegmenter = pysbd.Segmenter(language=\"en\", clean=False)\n\nsentences = segmenter.segment(text)\n\nfor i, sent in enumerate(sentences):\n    print(f\"Sentence {i+1}: {sent}\")","lang":"python","description":"This example demonstrates basic sentence segmentation using the `Segmenter` class for English text. The `clean=False` parameter is used to prevent aggressive text cleaning."},"warnings":[{"fix":"For spaCy v3.x, use `@Language.factory('pysbd_segmenter')` decorator on a custom factory function, then `nlp.add_pipe('pysbd_segmenter')`. Refer to spaCy's updated documentation for custom pipeline components.","message":"When integrating with spaCy, examples for spaCy v2.x using `nlp.add_pipe(PySBDFactory(nlp))` are not compatible with spaCy v3.x. The `add_pipe` API changed to expect a string name of the registered component factory.","severity":"breaking","affected_versions":"spaCy v3.x and later"},{"fix":"If precise, identical segmentation is critical across different integration methods, thoroughly test both approaches with your specific text data. Direct usage (`pysbd.Segmenter().segment(text)`) generally aligns with the expected rule-based output.","message":"Segmentation results might differ slightly when using `pysbd.Segmenter` directly compared to using `pysbd` as a spaCy pipeline component, especially with quoted text or complex punctuation.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If you need to preserve the original text as much as possible, set `clean=False` (which is the default in some examples but not universally). Implement custom cleaning steps if specific pre-processing is required.","message":"The `clean=True` parameter in the `Segmenter` constructor performs aggressive pre-filtering of the input text, removing repeated punctuation, line breaks, URLs, and HTML tags. This might alter the original text more than desired for certain NLP tasks.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Initialize the segmenter as `segmenter = pysbd.Segmenter(language='en', char_span=True)`. The output will then be a list of `TextSpan` objects, each containing the segmented sentence, its start, and end character indices.","message":"By default, `pysbd.Segmenter` returns a list of strings. If you require character offsets into the original text for non-destructive tokenization, you must initialize `Segmenter` with `char_span=True`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Consider performance implications for very large text corpora or real-time applications. Benchmark `pysbd` against other tools if speed is a primary concern. For high accuracy where speed is not the absolute bottleneck, `pysbd` remains a strong choice.","message":"While highly accurate, `pysbd` is a rule-based system implemented in Python. It may be slower compared to some sentence boundary detection alternatives that are implemented in lower-level languages like C++ or optimized using Cython.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}