{"id":8349,"library":"news-please","title":"news-please: News Crawler and Extractor","description":"news-please is an open-source, easy-to-use Python library designed for crawling news websites and extracting structured information from articles. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles. The library also provides an API for programmatic use within Python applications and supports extracting articles from the commoncrawl.org news archive. It is currently active, with version 1.6.16 released, and maintains a regular release cadence.","status":"active","version":"1.6.16","language":"en","source_language":"en","source_url":"https://github.com/fhamborg/news-please","tags":["news","crawler","scraper","information extraction","web scraping","article extraction"],"install":[{"cmd":"pip install news-please","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core web crawling framework.","package":"Scrapy","optional":false},{"reason":"Article extraction and content analysis.","package":"Newspaper3k","optional":false},{"reason":"For parsing HTML and XML documents, with specific installation considerations on Windows.","package":"lxml","optional":false},{"reason":"For downloading web pages.","package":"requests","optional":false},{"reason":"For extracting top-level domain information.","package":"tldextract","optional":false},{"reason":"Optional database backend for results.","package":"PyMySQL","optional":true},{"reason":"Optional database backend for results.","package":"psycopg2-binary","optional":true},{"reason":"Optional backend for storing and versioning extracted data.","package":"elasticsearch","optional":true}],"imports":[{"symbol":"NewsPlease","correct":"from newsplease import NewsPlease"}],"quickstart":{"code":"from newsplease import NewsPlease\n\nurl = 'https://www.theguardian.com/world/2023/jan/01/ukraine-war-russia-new-year-attacks'\narticle = NewsPlease.from_url(url)\n\nif article:\n    print(f\"Title: {article.title}\")\n    print(f\"Authors: {', '.join(article.authors)}\")\n    print(f\"Publish Date: {article.date_publish}\")\n    print(f\"Main Text (excerpt): {article.maintext[:200]}...\")\nelse:\n    print(f\"Failed to extract article from {url}\")","lang":"python","description":"This quickstart demonstrates how to extract structured information from a single news article URL using the `NewsPlease.from_url()` method. It prints the article's title, authors, publication date, and an excerpt of the main text."},"warnings":[{"fix":"For 'lxml', download a compatible wheel from Christoph Gohlke's Python page (unofficial but common source) and `pip install` it. For 'pywin32', download and run its installer.","message":"Windows users may encounter issues installing direct dependencies like 'lxml' and 'pywin32' via pip, requiring manual installation of pre-compiled wheels.","severity":"gotcha","affected_versions":"All versions on Windows"},{"fix":"Configure a custom and less generic `USER_AGENT` in the `config.cfg` file (default location `~/news-please/config`) to improve crawl success rates. For example, `USER_AGENT = 'news-please (+http://www.example.com)'`.","message":"Using the default User-Agent string can lead to aggressive crawling being blocked by many news websites.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For comprehensive website crawling or RSS-based continuous crawling, use the command-line interface. For individual article extraction from known URLs, use the library functions like `NewsPlease.from_url()` or `NewsPlease.from_html()`.","message":"There's a distinction between CLI mode (for full website crawling or continuous RSS feeds) and library mode (for extracting individual URLs). Attempting full crawls directly through the library API might not yield expected results without proper setup.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Press `CTRL+C` once and wait for news-please to shut down gracefully (typically 5-60 seconds). Only press `CTRL+C` twice for an immediate, forceful kill if absolutely necessary.","message":"When running news-please in CLI mode, pressing `CTRL+C` multiple times to terminate the process is not recommended and can lead to data inconsistencies. It's best to allow for a graceful shutdown.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure `news-please` is correctly installed (`pip install news-please`). Verify your import statement is `from newsplease import NewsPlease`. Check that there isn't a Python file named `newsplease.py` or a folder named `newsplease` in your current working directory that might conflict with the installed package.","cause":"This typically occurs if Python cannot find the `NewsPlease` class within the `newsplease` package. Common reasons include a typo in the import statement, an improperly installed package, or a local file named `newsplease.py` shadowing the installed library.","error":"ImportError: cannot import name 'NewsPlease' from 'newsplease'"},{"fix":"Inspect the `article` object for `None` values or missing attributes. Consider trying a different URL or checking if the target website has implemented new anti-bot measures. For very dynamic sites, additional pre-processing or custom extraction logic might be required outside of `news-please`.","cause":"Web scraping is inherently fragile. Website layouts change frequently, and anti-scraping measures can prevent successful extraction. This means `news-please` might not always successfully parse an article or might return an incomplete object.","error":"Failed to extract article from URL / Article object is empty or missing expected fields."},{"fix":"Implement a custom, less identifiable `USER_AGENT` in your configuration (`config.cfg`). Consider adding delays between requests (rate limiting) to avoid overwhelming the server. If persistent, you might be rate-limited or IP-blocked, requiring a proxy or waiting period.","cause":"This error often indicates that the target server actively rejected or closed the connection. This can be due to aggressive request rates, a blocked User-Agent, or IP-based blocking by the website's security systems.","error":"requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response')) or similar network errors."}]}