Newspaper3k

0.2.8 · maintenance · verified Sun Apr 12

Newspaper3k is a Python 3 library designed for simplified article discovery, extraction, and natural language processing (NLP) from news websites. It excels at extracting main content, metadata like title, author, publish date, images, and videos, as well as generating keywords and summaries. Although its last PyPI release was in 2018, it remains functional for many use cases, though a community fork (`newspaper4k`) provides more active development and modern features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract an article's content and metadata, including NLP-generated keywords and summaries. It also includes configuration for a user agent and NLTK 'punkt' tokenizer download, which is necessary for NLP features.

import newspaper
from newspaper import Article, Config
import os

# Configure a user agent to avoid being blocked
config = Config()
config.browser_user_agent = os.environ.get('USER_AGENT', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36')
config.request_timeout = 10 # Set a timeout

# Ensure NLTK 'punkt' is downloaded for NLP features
try:
    import nltk
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    print("Downloading NLTK 'punkt' tokenizer...")
    nltk.download('punkt')
    print("NLTK 'punkt' tokenizer downloaded.")

url = 'https://www.reuters.com/world/europe/ukraine-braces-russian-attacks-east-civilians-flee-2022-04-08/'
article = Article(url, config=config)

article.download()
article.parse()

print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Top Image: {article.top_image}")
print(f"\nText (first 500 chars):\n{article.text[:500]}...")

article.nlp() # Run NLP for keywords and summary
print(f"\nKeywords: {article.keywords}")
print(f"Summary: {article.summary[:200]}...")

# Example for a news source
# cnn_paper = newspaper.build('http://cnn.com', config=config)
# print(f"CNN has {cnn_paper.size()} articles.")
# for article_obj in cnn_paper.articles[:3]:
#     print(f"  - {article_obj.url}")

view raw JSON →