Sentence Stream
Sentence Stream is a small, pure Python library for splitting text into sentences. It is designed to work efficiently with text streams, such as large files or network streams, by processing text incrementally without loading the entire content into memory. The current version is 1.3.0, and it maintains an active release cadence for improvements and bug fixes.
Common errors
-
ModuleNotFoundError: No module named 'sentence-stream'
cause Incorrect import statement due to using hyphens instead of underscores in the module name.fixChange the import statement to `from sentence_stream import SentenceStream`. -
TypeError: 'NoneType' object is not iterable
cause Passing `None` or an unexpected non-string, non-iterable object to `SentenceStream`'s constructor.fixEnsure the input to `SentenceStream` is either a string or an iterable that yields strings (e.g., a file-like object). -
AttributeError: 'SentenceStream' object has no attribute 'read' (or 'split_text', etc.)
cause Attempting to call a method that doesn't exist on the `SentenceStream` object, possibly confusing it with traditional string methods or file objects.fixThe `SentenceStream` object itself is an iterable. To get sentences, simply iterate over the instance: `for sentence in my_stream: ...`.
Warnings
- gotcha This library provides a rule-based sentence splitter and is not intended for advanced Natural Language Processing (NLP) tokenization that requires deep linguistic understanding or model-based analysis. It focuses on basic punctuation-driven splitting.
- gotcha While `SentenceStream` accepts a single string as input, its primary performance benefit comes from processing actual input streams (iterables that yield chunks of text). If you pass a very large single string, the library will still buffer it internally before processing, potentially negating some of the streaming advantages for memory.
- gotcha The library primarily uses standard English punctuation rules. While robust for many cases, it may not perfectly handle highly ambiguous punctuation, abbreviations, or specific linguistic nuances across all languages without explicit configuration or custom rules.
Install
-
pip install sentence-stream
Imports
- SentenceStream
from sentence-stream import SentenceStream
from sentence_stream import SentenceStream
Quickstart
from sentence_stream import SentenceStream
# Example with a simple string
text_input = "Hello world. This is a test. Another sentence.\nNew paragraph. One more?"
stream = SentenceStream(text_input)
print("--- Processing string input ---")
for sentence in stream:
print(f"'{sentence}'")
# Example with a file-like object (simulate stream)
import io
long_text = "This is the first sentence. And here is the second one. " \
"The third sentence continues here. Finally, a fourth." \
"This could be a very large file." * 10
file_stream = io.StringIO(long_text)
stream_from_file = SentenceStream(file_stream)
print("\n--- Processing file-like object ---")
sentences_count = 0
for sentence in stream_from_file:
# print(f"'{sentence}'") # Uncomment to see all sentences
sentences_count += 1
print(f"Processed {sentences_count} sentences from stream.")