Newspaper4k
Newspaper4k is an open-source Python library for simplified article discovery and extraction from news websites. It is an actively maintained fork of the 'newspaper3k' project, offering new features, bug fixes, and improved parsing performance. The current version is 0.9.5, with frequent updates to enhance language support, address compatibility issues, and improve article content extraction.
Warnings
- breaking Newspaper4k requires Python 3.10 or higher. Older Python versions (3.8 and 3.9) are no longer officially supported as of version 0.9.4, though they might still function. Ensure your Python environment meets this requirement.
- gotcha The Google News integration (`GoogleNewsSource`) can be unstable. Google frequently changes its HTML structure and URL encoding, which may cause this functionality to break without notice. This requires the `gnews` optional dependency.
- gotcha The `article.nlp()` method, which extracts keywords and summaries, currently works most reliably on Western languages. Its performance and accuracy might be limited for non-Western languages, even with language-specific optional dependencies installed.
- gotcha Aggressively downloading many articles from a single source using multi-threading or rapid requests can lead to rate limiting, IP blocks, or CAPTCHA challenges from websites. Always respect `robots.txt` if enabled.
- gotcha When using the `Article` class directly (not `newspaper.article()`), you must explicitly call `article.download()` and `article.parse()` before attempting to access most article attributes (like `title`, `text`, `authors`, `publish_date`) or calling `article.nlp()`. Failure to do so will result in errors or empty data.
- deprecated The `text_cleaned` and `clean_doc` attributes/methods have been deprecated and removed. Direct access to `article.clean_top_node` is also removed.
Install
-
pip install newspaper4k -
pip install newspaper4k[all] -
pip install newspaper4k[gnews,cloudflare,zh]
Imports
- article
from newspaper import Article; Article(url).download().parse()
import newspaper article = newspaper.article(url)
- build
import newspaper source = newspaper.build(url)
Quickstart
import newspaper
# Example for a single article
url = "https://edition.cnn.com/2023/11/08/china/china-blizzard-disruption-intl-hnk/index.html"
article = newspaper.article(url)
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Top Image: {article.top_image}")
# Perform NLP for keywords and summary (requires NLTK and other NLP dependencies if installed)
article.nlp()
print(f"Summary: {article.summary}")
print(f"Keywords: {article.keywords}")
# Example for processing a news source (website)
# cnn_paper = newspaper.build('http://cnn.com')
# for article_obj in cnn_paper.articles:
# print(article_obj.url)
# article_obj.download()
# article_obj.parse()
# print(article_obj.title)