Sumy

0.12.0 · active · verified Thu Apr 16

Sumy is an active Python library (current version 0.12.0) for automatic text summarization, supporting a variety of algorithms like LSA, LexRank, Luhn, Edmundson, and TextRank. It provides utilities for parsing plain text, HTML pages, and integrates with NLTK for tokenization and stemming. The project maintains a regular release cadence, primarily focusing on language support and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use Sumy to summarize a plain text document using the LSA (Latent Semantic Analysis) summarizer. It includes necessary imports, the required NLTK 'punkt' data download, and sets up a parser, stemmer, and summarizer to extract a specified number of sentences from the input text.

import nltk
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lsa import LsaSummarizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words

# Download NLTK 'punkt' data if not already present
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

LANGUAGE = "english"
SENTENCES_COUNT = 5

text = (
    "Machine learning is transforming industries worldwide. "
    "Companies are investing heavily in AI research and development. "
    "The future of technology depends on these advancements. "
    "Natural Language Processing (NLP) is a field of Artificial Intelligence "
    "that focuses on the interaction between computers and humans through natural language. "
    "The goal of NLP is to enable computers to understand, interpret, and generate human language "
    "in a way that is both meaningful and useful. "
    "Common NLP applications include language translation, sentiment analysis, "
    "speech recognition, and text summarization."
)

parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
stemmer = Stemmer(LANGUAGE)

summarizer = LsaSummarizer(stemmer)
summarizer.stop_words = get_stop_words(LANGUAGE)

print(f"Original text length: {len(text.split())} words\n")
print(f"Summary ({SENTENCES_COUNT} sentences) using LSA Summarizer:\n")
for sentence in summarizer(parser.document, SENTENCES_COUNT):
    print(sentence)

view raw JSON →