readability-lxml
readability-lxml is a Python library that provides a fast HTML to text parser, designed to extract and clean up the main body text and title from an HTML document. It is a Python port of a Ruby port of arc90's Readability project. The library is actively maintained, with the latest version being 0.8.4.1 as of May 2025 (last PyPI upload date), and new releases typically occur to add Python version support, fix bugs, or add minor features.
Warnings
- breaking Version 0.8 replaced XHTML output with HTML5 output in the `summary()` call. If your application was expecting strict XHTML, this change could break parsing or rendering logic.
- gotcha There is a potential import name collision with the `py-readability-metrics` library, as both attempt to import a `Document` class from a top-level `readability` package. Using both in the same environment can lead to one overriding the other.
- gotcha The library relies on `lxml` which in turn requires `libxml2` and `libxslt` C libraries. While `pip install` often handles binary wheels, source builds on some platforms (like macOS or Linux distributions without pre-packaged dev libraries) might require manual installation of these system dependencies.
- deprecated While older versions (up to 0.6) explicitly supported Python 2.6, 2.7, 3.3, 3.4, the project summary now states 'python 3 support' and recent updates focus on Python 3.7+ (up to 3.13). Python 2.x support is effectively deprecated and likely broken in current versions.
Install
-
pip install readability-lxml
Imports
- Document
from readability import Document
Quickstart
import requests
from readability import Document
import os # For example usage, though not strictly required by readability-lxml itself
from lxml.html import fromstring # For plain text conversion
# Replace with a real URL or local HTML content
url = os.environ.get('READABILITY_TEST_URL', 'http://example.com')
try:
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an exception for HTTP errors
html_content = response.content
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
html_content = b"<html><body><h1>Default Title</h1><p>This is some example content.</p></body></html>"
doc = Document(html_content)
title = doc.title()
summary_html = doc.summary()
print(f"Title: {title}")
print("Summary HTML (first 500 chars):")
print(summary_html[:500])
# Optional: Get a plain text version (strip tags) using lxml.html
clean_doc = fromstring(summary_html)
print("\nSummary Text (first 200 chars):")
print(clean_doc.text_content()[:200])