Newspaper4k

0.9.5 · active · verified Wed Apr 15

Newspaper4k is an open-source Python library for simplified article discovery and extraction from news websites. It is an actively maintained fork of the 'newspaper3k' project, offering new features, bug fixes, and improved parsing performance. The current version is 0.9.5, with frequent updates to enhance language support, address compatibility issues, and improve article content extraction.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract key information from a single news article using the `newspaper.article()` helper. It retrieves the title, authors, publish date, top image, and then performs NLP to get a summary and keywords. A commented-out example shows how to initialize and crawl an entire news source using `newspaper.build()` and iterate through its articles.

import newspaper

# Example for a single article
url = "https://edition.cnn.com/2023/11/08/china/china-blizzard-disruption-intl-hnk/index.html"
article = newspaper.article(url)

print(f"Title: {article.title}")
print(f"Authors: {article.authors}")
print(f"Publish Date: {article.publish_date}")
print(f"Top Image: {article.top_image}")

# Perform NLP for keywords and summary (requires NLTK and other NLP dependencies if installed)
article.nlp()
print(f"Summary: {article.summary}")
print(f"Keywords: {article.keywords}")

# Example for processing a news source (website)
# cnn_paper = newspaper.build('http://cnn.com')
# for article_obj in cnn_paper.articles:
#    print(article_obj.url)
#    article_obj.download()
#    article_obj.parse()
#    print(article_obj.title)

view raw JSON →