ReadabiliPy
ReadabiliPy is a Python library that provides a wrapper for Mozilla's Readability.js, a powerful tool for extracting the main content from HTML pages. It also includes pure Python article extraction routines. The library augments the Readability.js output to include plain text representations of article paragraphs. The current version is 0.3.0, and it has an active development status, with updates released periodically.
Warnings
- gotcha To utilize Mozilla's Readability.js functionality, you must have Node.js (version 14 or higher) installed and accessible in your system's PATH. Without Node.js, `readabilipy` will silently fall back to its pure Python extraction routines when `use_readability=True` is specified.
- gotcha The `use_readability` flag (defaulting to `True` in `simple_json_from_html_string`) controls whether the Node.js-based Readability.js wrapper or the pure Python extractor is used. The results from these two methods can differ significantly for certain articles.
- breaking Prior to v0.3.0, users frequently encountered `UnicodeEncodeError` and `UnicodeDecodeError` when processing certain HTML content due to encoding issues with external Node.js subprocess calls and file handling.
- gotcha Versions prior to v0.3.0 had a bug related to changes in the working directory during article extraction, potentially leading to incorrect file paths or failures when using the Readability.js wrapper.
Install
-
pip install readabilipy
Imports
- simple_json_from_html_string
from readabilipy import simple_json_from_html_string
Quickstart
import requests
from readabilipy import simple_json_from_html_string
# Example HTML content (or fetch from a URL)
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
try:
req = requests.get(url, timeout=10)
req.raise_for_status() # Raise an exception for HTTP errors
html_content = req.text
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
html_content = "<html><body><h1>Example Article</h1><p>This is a paragraph.</p></body></html>"
# Extract article using Readability.js (requires Node.js installed)
# Set use_readability=True to enable the Node.js wrapper
# If Node.js is not found, it will fall back to the Python-only parser
article_js = simple_json_from_html_string(html_content, use_readability=True)
print("--- Extracted with Readability.js (or Python fallback) ---")
print(f"Title: {article_js.get('title')}")
print(f"Content snippet: {article_js.get('plain_text', [''])[0][:100]}...")
# Extract article using the pure Python implementation
article_py = simple_json_from_html_string(html_content, use_readability=False)
print("\n--- Extracted with Pure Python ---")
print(f"Title: {article_py.get('title')}")
print(f"Content snippet: {article_py.get('plain_text', [''])[0][:100]}...")