news-please: News Crawler and Extractor
news-please is an open-source, easy-to-use Python library designed for crawling news websites and extracting structured information from articles. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. The library also provides an API for programmatic use within Python applications and supports extracting articles from the commoncrawl.org news archive. It is currently active, with version 1.6.16 released, and maintains a regular release cadence.
Common errors
-
ImportError: cannot import name 'NewsPlease' from 'newsplease'
cause This typically occurs if Python cannot find the `NewsPlease` class within the `newsplease` package. Common reasons include a typo in the import statement, an improperly installed package, or a local file named `newsplease.py` shadowing the installed library.fixEnsure `news-please` is correctly installed (`pip install news-please`). Verify your import statement is `from newsplease import NewsPlease`. Check that there isn't a Python file named `newsplease.py` or a folder named `newsplease` in your current working directory that might conflict with the installed package. -
Failed to extract article from URL / Article object is empty or missing expected fields.
cause Web scraping is inherently fragile. Website layouts change frequently, and anti-scraping measures can prevent successful extraction. This means `news-please` might not always successfully parse an article or might return an incomplete object.fixInspect the `article` object for `None` values or missing attributes. Consider trying a different URL or checking if the target website has implemented new anti-bot measures. For very dynamic sites, additional pre-processing or custom extraction logic might be required outside of `news-please`. -
requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) or similar network errors.cause This error often indicates that the target server actively rejected or closed the connection. This can be due to aggressive request rates, a blocked User-Agent, or IP-based blocking by the website's security systems.fixImplement a custom, less identifiable `USER_AGENT` in your configuration (`config.cfg`). Consider adding delays between requests (rate limiting) to avoid overwhelming the server. If persistent, you might be rate-limited or IP-blocked, requiring a proxy or waiting period.
Warnings
- gotcha Windows users may encounter issues installing direct dependencies like 'lxml' and 'pywin32' via pip, requiring manual installation of pre-compiled wheels.
- gotcha Using the default User-Agent string can lead to aggressive crawling being blocked by many news websites.
- gotcha There's a distinction between CLI mode (for full website crawling or continuous RSS feeds) and library mode (for extracting individual URLs). Attempting full crawls directly through the library API might not yield expected results without proper setup.
- gotcha When running news-please in CLI mode, pressing `CTRL+C` multiple times to terminate the process is not recommended and can lead to data inconsistencies. It's best to allow for a graceful shutdown.
Install
-
pip install news-please
Imports
- NewsPlease
from newsplease import NewsPlease
Quickstart
from newsplease import NewsPlease
url = 'https://www.theguardian.com/world/2023/jan/01/ukraine-war-russia-new-year-attacks'
article = NewsPlease.from_url(url)
if article:
print(f"Title: {article.title}")
print(f"Authors: {', '.join(article.authors)}")
print(f"Publish Date: {article.date_publish}")
print(f"Main Text (excerpt): {article.maintext[:200]}...")
else:
print(f"Failed to extract article from {url}")