{"id":4372,"library":"newspaper3k","title":"Newspaper3k","description":"Newspaper3k is a Python 3 library designed for simplified article discovery, extraction, and natural language processing (NLP) from news websites. It excels at extracting main content, metadata like title, author, publish date, images, and videos, as well as generating keywords and summaries. Although its last PyPI release was in 2018, it remains functional for many use cases, though a community fork (`newspaper4k`) provides more active development and modern features.","status":"maintenance","version":"0.2.8","language":"en","source_language":"en","source_url":"https://github.com/codelucas/newspaper/","tags":["web scraping","NLP","article extraction","news","content extraction","automation"],"install":[{"cmd":"pip install newspaper3k","lang":"bash","label":"Install core library"},{"cmd":"python -c \"import nltk; nltk.download('punkt')\"","lang":"bash","label":"Download NLTK 'punkt' tokenizer (required for NLP features)"}],"dependencies":[{"reason":"Used for Natural Language Processing features like keyword extraction and summarization, specifically requires the 'punkt' tokenizer.","package":"nltk"},{"reason":"Core dependency for efficient HTML parsing.","package":"lxml"},{"reason":"Used for parsing HTML, often internally, and can be used for custom extraction when core library fails.","package":"beautifulsoup4"},{"reason":"Handles HTTP requests for downloading web content.","package":"requests"},{"reason":"For image processing.","package":"Pillow"}],"imports":[{"note":"Despite the package name 'newspaper3k', the top-level import for classes is typically 'newspaper'.","wrong":"from newspaper3k import Article","symbol":"Article","correct":"from newspaper import Article"},{"note":"The 'build' function is accessed directly from the imported 'newspaper' module.","wrong":"import newspaper3k; newspaper3k.build(...)","symbol":"build","correct":"import newspaper; newspaper.build(...)"},{"note":"Used for advanced configurations like user agents, proxies, and caching.","symbol":"Config","correct":"from newspaper import Config"}],"quickstart":{"code":"import newspaper\nfrom newspaper import Article, Config\nimport os\n\n# Configure a user agent to avoid being blocked\nconfig = Config()\nconfig.browser_user_agent = os.environ.get('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36')\nconfig.request_timeout = 10 # Set a timeout\n\n# Ensure NLTK 'punkt' is downloaded for NLP features\ntry:\n    import nltk\n    nltk.data.find('tokenizers/punkt')\nexcept nltk.downloader.DownloadError:\n    print(\"Downloading NLTK 'punkt' tokenizer...\")\n    nltk.download('punkt')\n    print(\"NLTK 'punkt' tokenizer downloaded.\")\n\nurl = 'https://www.reuters.com/world/europe/ukraine-braces-russian-attacks-east-civilians-flee-2022-04-08/'\narticle = Article(url, config=config)\n\narticle.download()\narticle.parse()\n\nprint(f\"Title: {article.title}\")\nprint(f\"Authors: {article.authors}\")\nprint(f\"Publish Date: {article.publish_date}\")\nprint(f\"Top Image: {article.top_image}\")\nprint(f\"\\nText (first 500 chars):\\n{article.text[:500]}...\")\n\narticle.nlp() # Run NLP for keywords and summary\nprint(f\"\\nKeywords: {article.keywords}\")\nprint(f\"Summary: {article.summary[:200]}...\")\n\n# Example for a news source\n# cnn_paper = newspaper.build('http://cnn.com', config=config)\n# print(f\"CNN has {cnn_paper.size()} articles.\")\n# for article_obj in cnn_paper.articles[:3]:\n#     print(f\"  - {article_obj.url}\")","lang":"python","description":"This quickstart demonstrates how to extract an article's content and metadata, including NLP-generated keywords and summaries. It also includes configuration for a user agent and NLTK 'punkt' tokenizer download, which is necessary for NLP features."},"warnings":[{"fix":"Always use `pip install newspaper3k`. Ensure your environment uses Python 3.","message":"The `newspaper` package is for Python 2 and is deprecated. For Python 3, you MUST install `newspaper3k`. Using `pip install newspaper` on Python 3 might lead to issues or install an old, unmaintained version.","severity":"breaking","affected_versions":"<=0.0.9 (for `newspaper`), all versions of `newspaper3k` when incorrectly installed"},{"fix":"Be aware of potential parsing failures on complex or JavaScript-heavy sites. For active development and improved parsing, evaluate the `newspaper4k` library (e.g., `pip install newspaper4k`).","message":"The `newspaper3k` library has not seen a PyPI release since 2018 (version 0.2.8). While still functional, it may struggle with modern web structures or newer Python versions. Consider `newspaper4k` (a community fork) for active development and bug fixes.","severity":"deprecated","affected_versions":"0.2.8 and older"},{"fix":"For persistent issues on specific sites, inspect the website's HTML. You may need to manually extract content using `newspaper.utils.BeautifulSoup` or integrate other scraping tools like `requests` and `BeautifulSoup` for pre-processing.","message":"Website HTML structures change frequently, which can break `newspaper3k`'s article extraction logic. Common issues include missing authors, incomplete text, or inability to parse specific elements.","severity":"gotcha","affected_versions":"All"},{"fix":"Run `python -c \"import nltk; nltk.download('punkt')\"` once to download the necessary data after installing the library. Ensure `nltk` is installed (it's a dependency, but the data is separate).","message":"The NLP features (like `article.nlp()` for keywords and summaries) rely on NLTK and require the `punkt` tokenizer data to be downloaded. Without it, you will encounter `LookupError`.","severity":"gotcha","affected_versions":"All"},{"fix":"Always configure a `Config` object with `config.browser_user_agent` set to a common browser user agent string and `config.request_timeout` to a reasonable value before using `Article` or `build`. Consider using proxies if scraping at scale.","message":"Aggressive or frequent requests without setting a proper `User-Agent` or `request_timeout` can lead to `ReadTimeout` errors or IP blocking by target websites.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}