news-please: News Crawler and Extractor

1.6.16 · active · verified Thu Apr 16

news-please is an open-source, easy-to-use Python library designed for crawling news websites and extracting structured information from articles. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. The library also provides an API for programmatic use within Python applications and supports extracting articles from the commoncrawl.org news archive. It is currently active, with version 1.6.16 released, and maintains a regular release cadence.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to extract structured information from a single news article URL using the `NewsPlease.from_url()` method. It prints the article's title, authors, publication date, and an excerpt of the main text.

from newsplease import NewsPlease

url = 'https://www.theguardian.com/world/2023/jan/01/ukraine-war-russia-new-year-attacks'
article = NewsPlease.from_url(url)

if article:
    print(f"Title: {article.title}")
    print(f"Authors: {', '.join(article.authors)}")
    print(f"Publish Date: {article.date_publish}")
    print(f"Main Text (excerpt): {article.maintext[:200]}...")
else:
    print(f"Failed to extract article from {url}")

view raw JSON →