{"id":4390,"library":"readabilipy","title":"ReadabiliPy","description":"ReadabiliPy is a Python library that provides a wrapper for Mozilla's Readability.js, a powerful tool for extracting the main content from HTML pages. It also includes pure Python article extraction routines. The library augments the Readability.js output to include plain text representations of article paragraphs. The current version is 0.3.0, and it has an active development status, with updates released periodically.","status":"active","version":"0.3.0","language":"en","source_language":"en","source_url":"https://github.com/alan-turing-institute/ReadabiliPy","tags":["readability","html parsing","content extraction","nodejs wrapper","web scraping"],"install":[{"cmd":"pip install readabilipy","lang":"bash","label":"Install ReadabiliPy"}],"dependencies":[{"reason":"Required for using Mozilla's Readability.js wrapper functionality (version 14 or higher). Not needed for the pure Python extractor.","package":"Node.js","optional":true},{"reason":"Runtime dependency for HTML parsing.","package":"beautifulsoup4","optional":false},{"reason":"Runtime dependency for HTML parsing.","package":"html5lib","optional":false},{"reason":"Runtime dependency for HTML parsing.","package":"lxml","optional":false},{"reason":"Runtime dependency for text processing.","package":"regex","optional":false}],"imports":[{"note":"This is the primary function for extracting article content from an HTML string.","symbol":"simple_json_from_html_string","correct":"from readabilipy import simple_json_from_html_string"}],"quickstart":{"code":"import requests\nfrom readabilipy import simple_json_from_html_string\n\n# Example HTML content (or fetch from a URL)\nurl = \"https://en.wikipedia.org/wiki/Python_(programming_language)\"\ntry:\n    req = requests.get(url, timeout=10)\n    req.raise_for_status() # Raise an exception for HTTP errors\n    html_content = req.text\nexcept requests.exceptions.RequestException as e:\n    print(f\"Error fetching URL: {e}\")\n    html_content = \"<html><body><h1>Example Article</h1><p>This is a paragraph.</p></body></html>\"\n\n# Extract article using Readability.js (requires Node.js installed)\n# Set use_readability=True to enable the Node.js wrapper\n# If Node.js is not found, it will fall back to the Python-only parser\narticle_js = simple_json_from_html_string(html_content, use_readability=True)\nprint(\"--- Extracted with Readability.js (or Python fallback) ---\")\nprint(f\"Title: {article_js.get('title')}\")\nprint(f\"Content snippet: {article_js.get('plain_text', [''])[0][:100]}...\")\n\n# Extract article using the pure Python implementation\narticle_py = simple_json_from_html_string(html_content, use_readability=False)\nprint(\"\\n--- Extracted with Pure Python ---\")\nprint(f\"Title: {article_py.get('title')}\")\nprint(f\"Content snippet: {article_py.get('plain_text', [''])[0][:100]}...\")","lang":"python","description":"This quickstart demonstrates how to use `readabilipy` to extract article content from an HTML string. It shows both the `use_readability=True` option (which leverages Mozilla's Readability.js via Node.js if available) and the `use_readability=False` option (for the pure Python implementation). Note that the results may differ between the two methods."},"warnings":[{"fix":"Install Node.js (v14+) from nodejs.org or ensure it's in your system's PATH. If Node.js is not an option, set `use_readability=False` to explicitly use the Python-only extractor.","message":"To utilize Mozilla's Readability.js functionality, you must have Node.js (version 14 or higher) installed and accessible in your system's PATH. Without Node.js, `readabilipy` will silently fall back to its pure Python extraction routines when `use_readability=True` is specified.","severity":"gotcha","affected_versions":"<=0.3.0"},{"fix":"Always explicitly set `use_readability=True` or `use_readability=False` based on your desired behavior and ensure Node.js is correctly installed if you intend to use the Readability.js wrapper.","message":"The `use_readability` flag (defaulting to `True` in `simple_json_from_html_string`) controls whether the Node.js-based Readability.js wrapper or the pure Python extractor is used. The results from these two methods can differ significantly for certain articles.","severity":"gotcha","affected_versions":"<=0.3.0"},{"fix":"Upgrade to `readabilipy` v0.3.0 or newer. Ensure your input HTML is correctly encoded, preferably UTF-8.","message":"Prior to v0.3.0, users frequently encountered `UnicodeEncodeError` and `UnicodeDecodeError` when processing certain HTML content due to encoding issues with external Node.js subprocess calls and file handling.","severity":"breaking","affected_versions":"<0.3.0"},{"fix":"Upgrade to `readabilipy` v0.3.0 or newer, which includes a fix for this working directory bug. If upgrading is not possible, ensure your application does not change the current working directory while `readabilipy` is processing HTML.","message":"Versions prior to v0.3.0 had a bug related to changes in the working directory during article extraction, potentially leading to incorrect file paths or failures when using the Readability.js wrapper.","severity":"gotcha","affected_versions":"<0.3.0"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}