ReadabiliPy

0.3.0 · active · verified Sun Apr 12

ReadabiliPy is a Python library that provides a wrapper for Mozilla's Readability.js, a powerful tool for extracting the main content from HTML pages. It also includes pure Python article extraction routines. The library augments the Readability.js output to include plain text representations of article paragraphs. The current version is 0.3.0, and it has an active development status, with updates released periodically.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `readabilipy` to extract article content from an HTML string. It shows both the `use_readability=True` option (which leverages Mozilla's Readability.js via Node.js if available) and the `use_readability=False` option (for the pure Python implementation). Note that the results may differ between the two methods.

import requests
from readabilipy import simple_json_from_html_string

# Example HTML content (or fetch from a URL)
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
try:
    req = requests.get(url, timeout=10)
    req.raise_for_status() # Raise an exception for HTTP errors
    html_content = req.text
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
    html_content = "<html><body><h1>Example Article</h1><p>This is a paragraph.</p></body></html>"

# Extract article using Readability.js (requires Node.js installed)
# Set use_readability=True to enable the Node.js wrapper
# If Node.js is not found, it will fall back to the Python-only parser
article_js = simple_json_from_html_string(html_content, use_readability=True)
print("--- Extracted with Readability.js (or Python fallback) ---")
print(f"Title: {article_js.get('title')}")
print(f"Content snippet: {article_js.get('plain_text', [''])[0][:100]}...")

# Extract article using the pure Python implementation
article_py = simple_json_from_html_string(html_content, use_readability=False)
print("\n--- Extracted with Pure Python ---")
print(f"Title: {article_py.get('title')}")
print(f"Content snippet: {article_py.get('plain_text', [''])[0][:100]}...")

view raw JSON →