Sumy
Sumy is an active Python library (current version 0.12.0) for automatic text summarization, supporting a variety of algorithms like LSA, LexRank, Luhn, Edmundson, and TextRank. It provides utilities for parsing plain text, HTML pages, and integrates with NLTK for tokenization and stemming. The project maintains a regular release cadence, primarily focusing on language support and bug fixes.
Common errors
-
LookupError: ********************************************************************** Resource 'tokenizers/punkt' not found. Please use the NLTK Downloader to obtain the resource: >>> import nltk >>> nltk.download('punkt') For more information see: https://www.nltk.org/data.html Attempted to load tokenizers/punkt/PY3/english.pickle Searched in: - '/home/user/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' **********************************************************************cause The NLTK 'punkt' tokenizer data, a core dependency for Sumy's tokenizers, has not been downloaded.fixRun `python -c "import nltk; nltk.download('punkt')"` to download the required NLTK data files. -
ValueError: Chinese tokenizer requires jieba. Please, install it by command 'pip install jieba'.
cause Attempting to use a Sumy tokenizer for a language (e.g., Chinese) that requires an external, non-default Python library, and that library is not installed.fixInstall the missing language-specific dependency. For Chinese, `pip install jieba`. Refer to Sumy's documentation for other languages or use `pip install sumy[language]` where 'language' is the relevant extra. -
from collections import Sequence # Compatibility for Python 3.10
cause This is typically a traceback snippet indicating a compatibility issue with `collections.Sequence` which was deprecated and then removed in Python 3.10 and 3.11 respectively. Sumy versions prior to 0.10.0 had this issue.fixUpgrade Sumy to version 0.10.0 or newer: `pip install --upgrade sumy`. This version includes a fix for Python 3.10+ compatibility. Also ensure your Python environment is 3.8 or newer.
Warnings
- breaking Official support for Python 2.7 was dropped in Sumy v0.9.0.
- gotcha NLTK's 'punkt' tokenizer data is a mandatory dependency for most languages and must be downloaded separately using `nltk.download('punkt')`.
- breaking The `TextRankSummarizer` implementation was changed in v0.8.0 to use an iterative algorithm. The previous algorithm was renamed to `ReductionSummarizer`.
- breaking Support for `distutils` during installation was dropped in v0.6.0, affecting older Python environments or custom build processes.
- gotcha Certain languages (e.g., Chinese, Japanese, Korean, Hebrew, Thai) require additional Python packages for their tokenizers. These are listed as optional dependencies.
Install
-
pip install sumy -
pip install sumy[arabic,chinese,greek,hebrew,japanese,korean,polish,thai] -
python -c "import nltk; nltk.download('punkt')"
Imports
- PlaintextParser
from sumy.parsers.plaintext import PlaintextParser
- HtmlParser
from sumy.parsers.html import HtmlParser
- Tokenizer
from sumy.nlp.tokenizers import Tokenizer
- Stemmer
from sumy.nlp.stemmers import Stemmer
- LsaSummarizer
from sumy.summarizers.lsa import LsaSummarizer
- LexRankSummarizer
from sumy.summarizers.lex_rank import LexRankSummarizer
- LuhnSummarizer
from sumy.summarizers.luhn import LuhnSummarizer
- TextRankSummarizer
from sumy.summarizers.text_rank import TextRankSummarizer
- get_stop_words
from sumy.utils import get_stop_words
Quickstart
import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
# Download NLTK 'punkt' data if not already present
try:
nltk.data.find('tokenizers/punkt')
except LookupError:
nltk.download('punkt')
LANGUAGE = "english"
SENTENCES_COUNT = 5
text = (
"Machine learning is transforming industries worldwide. "
"Companies are investing heavily in AI research and development. "
"The future of technology depends on these advancements. "
"Natural Language Processing (NLP) is a field of Artificial Intelligence "
"that focuses on the interaction between computers and humans through natural language. "
"The goal of NLP is to enable computers to understand, interpret, and generate human language "
"in a way that is both meaningful and useful. "
"Common NLP applications include language translation, sentiment analysis, "
"speech recognition, and text summarization."
)
parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)
summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)
print(f"Original text length: {len(text.split())} words\n")
print(f"Summary ({SENTENCES_COUNT} sentences) using LSA Summarizer:\n")
for sentence in summarizer(parser.document, SENTENCES_COUNT):
print(sentence)