{"id":3790,"library":"readability-lxml","title":"readability-lxml","description":"readability-lxml is a Python library that provides a fast HTML to text parser, designed to extract and clean up the main body text and title from an HTML document. It is a Python port of a Ruby port of arc90's Readability project. The library is actively maintained, with the latest version being 0.8.4.1 as of May 2025 (last PyPI upload date), and new releases typically occur to add Python version support, fix bugs, or add minor features.","status":"active","version":"0.8.4.1","language":"en","source_language":"en","source_url":"https://github.com/buriy/python-readability","tags":["HTML parsing","readability","text extraction","web scraping","lxml"],"install":[{"cmd":"pip install readability-lxml","lang":"bash","label":"Via pip"}],"dependencies":[{"reason":"Used for character encoding detection.","package":"chardet"},{"reason":"Used for CSS selector support in parsing.","package":"cssselect"},{"reason":"Core dependency for HTML parsing and DOM manipulation.","package":"lxml"},{"reason":"Used for cleaning HTML.","package":"lxml-html-clean"}],"imports":[{"note":"The more direct `from readability import Document` is the current and recommended import path; older examples might use `from readability.readability import Document`.","wrong":"from readability.readability import Document","symbol":"Document","correct":"from readability import Document"}],"quickstart":{"code":"import requests\nfrom readability import Document\nimport os # For example usage, though not strictly required by readability-lxml itself\nfrom lxml.html import fromstring # For plain text conversion\n\n# Replace with a real URL or local HTML content\nurl = os.environ.get('READABILITY_TEST_URL', 'http://example.com')\n\ntry:\n    response = requests.get(url, timeout=10)\n    response.raise_for_status() # Raise an exception for HTTP errors\n    html_content = response.content\nexcept requests.exceptions.RequestException as e:\n    print(f\"Error fetching URL: {e}\")\n    html_content = b\"<html><body><h1>Default Title</h1><p>This is some example content.</p></body></html>\"\n\ndoc = Document(html_content)\ntitle = doc.title()\nsummary_html = doc.summary()\n\nprint(f\"Title: {title}\")\nprint(\"Summary HTML (first 500 chars):\")\nprint(summary_html[:500])\n\n# Optional: Get a plain text version (strip tags) using lxml.html\nclean_doc = fromstring(summary_html)\nprint(\"\\nSummary Text (first 200 chars):\")\nprint(clean_doc.text_content()[:200])","lang":"python","description":"This quickstart fetches HTML content from a URL (or uses a fallback) using `requests`, then uses `readability-lxml` to extract the article's title and a cleaned HTML summary. It also demonstrates how to get a plain text version from the summary HTML using `lxml.html`."},"warnings":[{"fix":"Update downstream parsers or consumers of `summary()` output to handle HTML5, or implement a conversion step if strict XHTML is required.","message":"Version 0.8 replaced XHTML output with HTML5 output in the `summary()` call. If your application was expecting strict XHTML, this change could break parsing or rendering logic.","severity":"breaking","affected_versions":"0.8 and later"},{"fix":"To use both libraries, it is recommended to isolate them in separate virtual environments or implement advanced importlib techniques to alias one of the modules.","message":"There is a potential import name collision with the `py-readability-metrics` library, as both attempt to import a `Document` class from a top-level `readability` package. Using both in the same environment can lead to one overriding the other.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure `libxml2-dev` and `libxslt-dev` (or equivalent development packages for your OS) are installed before attempting to install `lxml` or `readability-lxml` from source.","message":"The library relies on `lxml` which in turn requires `libxml2` and `libxslt` C libraries. While `pip install` often handles binary wheels, source builds on some platforms (like macOS or Linux distributions without pre-packaged dev libraries) might require manual installation of these system dependencies.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Migrate to Python 3.x if still on Python 2.x. Use `readability-lxml` 0.6 or older if strict Python 2.x compatibility is required (not recommended due to security and lack of maintenance).","message":"While older versions (up to 0.6) explicitly supported Python 2.6, 2.7, 3.3, 3.4, the project summary now states 'python 3 support' and recent updates focus on Python 3.7+ (up to 3.13). Python 2.x support is effectively deprecated and likely broken in current versions.","severity":"deprecated","affected_versions":"Versions after 0.6; fully deprecated in 0.7+"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}