{"id":9345,"library":"sumy","title":"Sumy","description":"Sumy is an active Python library (current version 0.12.0) for automatic text summarization, supporting a variety of algorithms like LSA, LexRank, Luhn, Edmundson, and TextRank. It provides utilities for parsing plain text, HTML pages, and integrates with NLTK for tokenization and stemming. The project maintains a regular release cadence, primarily focusing on language support and bug fixes.","status":"active","version":"0.12.0","language":"en","source_language":"en","source_url":"https://github.com/miso-belica/sumy","tags":["nlp","text summarization","extractive summarization","data mining","natural language processing"],"install":[{"cmd":"pip install sumy","lang":"bash","label":"Install core library"},{"cmd":"pip install sumy[arabic,chinese,greek,hebrew,japanese,korean,polish,thai]","lang":"bash","label":"Install with all optional language dependencies"},{"cmd":"python -c \"import nltk; nltk.download('punkt')\"","lang":"bash","label":"Download NLTK 'punkt' tokenizer data (required)"}],"dependencies":[{"reason":"Required for default tokenizers and stemmers; specific data ('punkt') must be downloaded.","package":"nltk","optional":false},{"reason":"Required by the LSA summarizer.","package":"numpy","optional":true},{"reason":"Required for parsing content from URLs (e.g., HtmlParser).","package":"requests","optional":true},{"reason":"Required for Chinese language tokenizer.","package":"jieba","optional":true},{"reason":"Required for Korean language tokenizer.","package":"konlpy","optional":true},{"reason":"Required for Japanese language tokenizer.","package":"tinysegmenter","optional":true},{"reason":"Required for Thai language tokenizer.","package":"pythainlp","optional":true},{"reason":"Required for Hebrew language tokenizer.","package":"hebrew_tokenizer","optional":true}],"imports":[{"symbol":"PlaintextParser","correct":"from sumy.parsers.plaintext import PlaintextParser"},{"symbol":"HtmlParser","correct":"from sumy.parsers.html import HtmlParser"},{"symbol":"Tokenizer","correct":"from sumy.nlp.tokenizers import Tokenizer"},{"symbol":"Stemmer","correct":"from sumy.nlp.stemmers import Stemmer"},{"symbol":"LsaSummarizer","correct":"from sumy.summarizers.lsa import LsaSummarizer"},{"symbol":"LexRankSummarizer","correct":"from sumy.summarizers.lex_rank import LexRankSummarizer"},{"symbol":"LuhnSummarizer","correct":"from sumy.summarizers.luhn import LuhnSummarizer"},{"symbol":"TextRankSummarizer","correct":"from sumy.summarizers.text_rank import TextRankSummarizer"},{"symbol":"get_stop_words","correct":"from sumy.utils import get_stop_words"}],"quickstart":{"code":"import nltk\nfrom sumy.parsers.plaintext import PlaintextParser\nfrom sumy.nlp.tokenizers import Tokenizer\nfrom sumy.summarizers.lsa import LsaSummarizer\nfrom sumy.nlp.stemmers import Stemmer\nfrom sumy.utils import get_stop_words\n\n# Download NLTK 'punkt' data if not already present\ntry:\n    nltk.data.find('tokenizers/punkt')\nexcept LookupError:\n    nltk.download('punkt')\n\nLANGUAGE = \"english\"\nSENTENCES_COUNT = 5\n\ntext = (\n    \"Machine learning is transforming industries worldwide. \"\n    \"Companies are investing heavily in AI research and development. \"\n    \"The future of technology depends on these advancements. \"\n    \"Natural Language Processing (NLP) is a field of Artificial Intelligence \"\n    \"that focuses on the interaction between computers and humans through natural language. \"\n    \"The goal of NLP is to enable computers to understand, interpret, and generate human language \"\n    \"in a way that is both meaningful and useful. \"\n    \"Common NLP applications include language translation, sentiment analysis, \"\n    \"speech recognition, and text summarization.\"\n)\n\nparser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))\nstemmer = Stemmer(LANGUAGE)\n\nsummarizer = LsaSummarizer(stemmer)\nsummarizer.stop_words = get_stop_words(LANGUAGE)\n\nprint(f\"Original text length: {len(text.split())} words\\n\")\nprint(f\"Summary ({SENTENCES_COUNT} sentences) using LSA Summarizer:\\n\")\nfor sentence in summarizer(parser.document, SENTENCES_COUNT):\n    print(sentence)","lang":"python","description":"This quickstart demonstrates how to use Sumy to summarize a plain text document using the LSA (Latent Semantic Analysis) summarizer. It includes necessary imports, the required NLTK 'punkt' data download, and sets up a parser, stemmer, and summarizer to extract a specified number of sentences from the input text."},"warnings":[{"fix":"Upgrade to Python 3.8+.","message":"Official support for Python 2.7 was dropped in Sumy v0.9.0.","severity":"breaking","affected_versions":">=0.9.0"},{"fix":"Run `python -c \"import nltk; nltk.download('punkt')\"` once after installing Sumy and NLTK.","message":"NLTK's 'punkt' tokenizer data is a mandatory dependency for most languages and must be downloaded separately using `nltk.download('punkt')`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If using the older TextRank algorithm, switch to `ReductionSummarizer` or adapt your code to the new `TextRankSummarizer`.","message":"The `TextRankSummarizer` implementation was changed in v0.8.0 to use an iterative algorithm. The previous algorithm was renamed to `ReductionSummarizer`.","severity":"breaking","affected_versions":">=0.8.0"},{"fix":"Ensure your Python environment uses setuptools or pip for installation, or update to a modern Python version (3.8+).","message":"Support for `distutils` during installation was dropped in v0.6.0, affecting older Python environments or custom build processes.","severity":"breaking","affected_versions":">=0.6.0"},{"fix":"Install the specific language tokenizer package, e.g., `pip install sumy[chinese]` or `pip install jieba`.","message":"Certain languages (e.g., Chinese, Japanese, Korean, Hebrew, Thai) require additional Python packages for their tokenizers. These are listed as optional dependencies.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Run `python -c \"import nltk; nltk.download('punkt')\"` to download the required NLTK data files.","cause":"The NLTK 'punkt' tokenizer data, a core dependency for Sumy's tokenizers, has not been downloaded.","error":"LookupError: \n**********************************************************************\n  Resource 'tokenizers/punkt' not found.\n  Please use the NLTK Downloader to obtain the resource:\n\n  >>> import nltk\n  >>> nltk.download('punkt')\n  \n  For more information see: https://www.nltk.org/data.html\n\n  Attempted to load tokenizers/punkt/PY3/english.pickle\n  Searched in:\n    - '/home/user/nltk_data'\n    - '/usr/share/nltk_data'\n    - '/usr/local/share/nltk_data'\n    - '/usr/lib/nltk_data'\n    - '/usr/local/lib/nltk_data'\n**********************************************************************"},{"fix":"Install the missing language-specific dependency. For Chinese, `pip install jieba`. Refer to Sumy's documentation for other languages or use `pip install sumy[language]` where 'language' is the relevant extra.","cause":"Attempting to use a Sumy tokenizer for a language (e.g., Chinese) that requires an external, non-default Python library, and that library is not installed.","error":"ValueError: Chinese tokenizer requires jieba. Please, install it by command 'pip install jieba'."},{"fix":"Upgrade Sumy to version 0.10.0 or newer: `pip install --upgrade sumy`. This version includes a fix for Python 3.10+ compatibility. Also ensure your Python environment is 3.8 or newer.","cause":"This is typically a traceback snippet indicating a compatibility issue with `collections.Sequence` which was deprecated and then removed in Python 3.10 and 3.11 respectively. Sumy versions prior to 0.10.0 had this issue.","error":"from collections import Sequence # Compatibility for Python 3.10"}]}