{"id":6739,"library":"newspaper4k","title":"Newspaper4k","description":"Newspaper4k is an open-source Python library for simplified article discovery and extraction from news websites. It is an actively maintained fork of the 'newspaper3k' project, offering new features, bug fixes, and improved parsing performance. The current version is 0.9.5, with frequent updates to enhance language support, address compatibility issues, and improve article content extraction.","status":"active","version":"0.9.5","language":"en","source_language":"en","source_url":"https://github.com/AndyTheFactory/newspaper4k","tags":["web scraping","article extraction","NLP","news parsing","data extraction","content curation"],"install":[{"cmd":"pip install newspaper4k","lang":"bash","label":"Install base package"},{"cmd":"pip install newspaper4k[all]","lang":"bash","label":"Install with all optional dependencies"},{"cmd":"pip install newspaper4k[gnews,cloudflare,zh]","lang":"bash","label":"Install with specific optional dependencies (e.g., Google News, Cloudflare, Chinese language support)"}],"dependencies":[{"reason":"HTML parsing and navigation","package":"beautifulsoup4"},{"reason":"Image processing for top image extraction","package":"Pillow"},{"reason":"Configuration handling","package":"PyYAML"},{"reason":"Efficient XML/HTML parsing and cleaning","package":"lxml[html_clean]"},{"reason":"Natural Language Toolkit for text processing (optional, for nlp() functionality)","package":"nltk","optional":true},{"reason":"HTTP requests for downloading article content","package":"requests"},{"reason":"RSS/Atom feed parsing","package":"feedparser"},{"reason":"Extracting top-level domain from URLs","package":"tldextract"},{"reason":"Robust date/time parsing","package":"python-dateutil"},{"reason":"Typing support","package":"typing-extensions"},{"reason":"Brotli compression support","package":"brotli"},{"reason":"Bypassing Cloudflare protection (optional)","package":"cloudscraper","optional":true},{"reason":"Google News integration (optional)","package":"gnews","optional":true},{"reason":"Chinese language support (optional)","package":"jieba","optional":true},{"reason":"Thai language support (optional)","package":"pythainlp","optional":true},{"reason":"Japanese language support (optional)","package":"tinysegmenter","optional":true},{"reason":"Bengali, Hindi, Nepali, Tamil language support (optional)","package":"indic-nlp-library","optional":true},{"reason":"robots.txt enforcement (optional)","package":"protego","optional":true}],"imports":[{"note":"The 'newspaper.article()' helper function streamlines downloading and parsing, combining the Article object instantiation, download(), and parse() calls into one step. The direct Article class import and method calls still work but are more verbose for a single article.","wrong":"from newspaper import Article; Article(url).download().parse()","symbol":"article","correct":"import newspaper\narticle = newspaper.article(url)"},{"note":"Used for building a 'Source' object to crawl an entire news website.","symbol":"build","correct":"import newspaper\nsource = newspaper.build(url)"}],"quickstart":{"code":"import newspaper\n\n# Example for a single article\nurl = \"https://edition.cnn.com/2023/11/08/china/china-blizzard-disruption-intl-hnk/index.html\"\narticle = newspaper.article(url)\n\nprint(f\"Title: {article.title}\")\nprint(f\"Authors: {article.authors}\")\nprint(f\"Publish Date: {article.publish_date}\")\nprint(f\"Top Image: {article.top_image}\")\n\n# Perform NLP for keywords and summary (requires NLTK and other NLP dependencies if installed)\narticle.nlp()\nprint(f\"Summary: {article.summary}\")\nprint(f\"Keywords: {article.keywords}\")\n\n# Example for processing a news source (website)\n# cnn_paper = newspaper.build('http://cnn.com')\n# for article_obj in cnn_paper.articles:\n#    print(article_obj.url)\n#    article_obj.download()\n#    article_obj.parse()\n#    print(article_obj.title)","lang":"python","description":"This quickstart demonstrates how to extract key information from a single news article using the `newspaper.article()` helper. It retrieves the title, authors, publish date, top image, and then performs NLP to get a summary and keywords. A commented-out example shows how to initialize and crawl an entire news source using `newspaper.build()` and iterate through its articles."},"warnings":[{"fix":"Upgrade your Python environment to 3.10 or newer.","message":"Newspaper4k requires Python 3.10 or higher. Older Python versions (3.8 and 3.9) are no longer officially supported as of version 0.9.4, though they might still function. Ensure your Python environment meets this requirement.","severity":"breaking","affected_versions":">=0.9.4"},{"fix":"Be prepared for potential breakages and monitor for updates to the library. Consider alternative methods if reliability is critical.","message":"The Google News integration (`GoogleNewsSource`) can be unstable. Google frequently changes its HTML structure and URL encoding, which may cause this functionality to break without notice. This requires the `gnews` optional dependency.","severity":"gotcha","affected_versions":"All"},{"fix":"Verify NLP output for non-Western languages and consider external NLP libraries for more robust analysis if needed.","message":"The `article.nlp()` method, which extracts keywords and summaries, currently works most reliably on Western languages. Its performance and accuracy might be limited for non-Western languages, even with language-specific optional dependencies installed.","severity":"gotcha","affected_versions":"All"},{"fix":"Implement delays between requests, use proxy rotation, and rotate user-agent strings. Install `protego` optional dependency for robots.txt enforcement. For heavily protected sites (e.g., Cloudflare), `cloudscraper` (optional dependency) or external tools like Playwright might be necessary.","message":"Aggressively downloading many articles from a single source using multi-threading or rapid requests can lead to rate limiting, IP blocks, or CAPTCHA challenges from websites. Always respect `robots.txt` if enabled.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure `article.download()` and `article.parse()` are called sequentially before accessing article properties or running NLP.","message":"When using the `Article` class directly (not `newspaper.article()`), you must explicitly call `article.download()` and `article.parse()` before attempting to access most article attributes (like `title`, `text`, `authors`, `publish_date`) or calling `article.nlp()`. Failure to do so will result in errors or empty data.","severity":"gotcha","affected_versions":"All"},{"fix":"Remove any usage of `text_cleaned` or `clean_doc`. The primary article text is available via `article.text`.","message":"The `text_cleaned` and `clean_doc` attributes/methods have been deprecated and removed. Direct access to `article.clean_top_node` is also removed.","severity":"deprecated","affected_versions":">=0.9.2"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}