Newspaper3k
Newspaper3k is a Python 3 library designed for simplified article discovery, extraction, and natural language processing (NLP) from news websites. It excels at extracting main content, metadata like title, author, publish date, images, and videos, as well as generating keywords and summaries. Although its last PyPI release was in 2018, it remains functional for many use cases, though a community fork (`newspaper4k`) provides more active development and modern features.
Warnings
- breaking The `newspaper` package is for Python 2 and is deprecated. For Python 3, you MUST install `newspaper3k`. Using `pip install newspaper` on Python 3 might lead to issues or install an old, unmaintained version.
- deprecated The `newspaper3k` library has not seen a PyPI release since 2018 (version 0.2.8). While still functional, it may struggle with modern web structures or newer Python versions. Consider `newspaper4k` (a community fork) for active development and bug fixes.
- gotcha Website HTML structures change frequently, which can break `newspaper3k`'s article extraction logic. Common issues include missing authors, incomplete text, or inability to parse specific elements.
- gotcha The NLP features (like `article.nlp()` for keywords and summaries) rely on NLTK and require the `punkt` tokenizer data to be downloaded. Without it, you will encounter `LookupError`.
- gotcha Aggressive or frequent requests without setting a proper `User-Agent` or `request_timeout` can lead to `ReadTimeout` errors or IP blocking by target websites.
Install
-
pip install newspaper3k -
python -c "import nltk; nltk.download('punkt')"
Imports
- Article
from newspaper import Article
- build
import newspaper; newspaper.build(...)
- Config
from newspaper import Config
Quickstart
import newspaper
from newspaper import Article, Config
import os
# Configure a user agent to avoid being blocked
config = Config()
config.browser_user_agent = os.environ.get('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36')
config.request_timeout = 10 # Set a timeout
# Ensure NLTK 'punkt' is downloaded for NLP features
try:
import nltk
nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
print("Downloading NLTK 'punkt' tokenizer...")
nltk.download('punkt')
print("NLTK 'punkt' tokenizer downloaded.")
url = 'https://www.reuters.com/world/europe/ukraine-braces-russian-attacks-east-civilians-flee-2022-04-08/'
article = Article(url, config=config)
article.download()
article.parse()
print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Top Image: {article.top_image}")
print(f"\nText (first 500 chars):\n{article.text[:500]}...")
article.nlp() # Run NLP for keywords and summary
print(f"\nKeywords: {article.keywords}")
print(f"Summary: {article.summary[:200]}...")
# Example for a news source
# cnn_paper = newspaper.build('http://cnn.com', config=config)
# print(f"CNN has {cnn_paper.size()} articles.")
# for article_obj in cnn_paper.articles[:3]:
# print(f" - {article_obj.url}")