readability-lxml

0.8.4.1 · active · verified Sat Apr 11

readability-lxml is a Python library that provides a fast HTML to text parser, designed to extract and clean up the main body text and title from an HTML document. It is a Python port of a Ruby port of arc90's Readability project. The library is actively maintained, with the latest version being 0.8.4.1 as of May 2025 (last PyPI upload date), and new releases typically occur to add Python version support, fix bugs, or add minor features.

Warnings

Install

Imports

Quickstart

This quickstart fetches HTML content from a URL (or uses a fallback) using `requests`, then uses `readability-lxml` to extract the article's title and a cleaned HTML summary. It also demonstrates how to get a plain text version from the summary HTML using `lxml.html`.

import requests
from readability import Document
import os # For example usage, though not strictly required by readability-lxml itself
from lxml.html import fromstring # For plain text conversion

# Replace with a real URL or local HTML content
url = os.environ.get('READABILITY_TEST_URL', 'http://example.com')

try:
    response = requests.get(url, timeout=10)
    response.raise_for_status() # Raise an exception for HTTP errors
    html_content = response.content
except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
    html_content = b"<html><body><h1>Default Title</h1><p>This is some example content.</p></body></html>"

doc = Document(html_content)
title = doc.title()
summary_html = doc.summary()

print(f"Title: {title}")
print("Summary HTML (first 500 chars):")
print(summary_html[:500])

# Optional: Get a plain text version (strip tags) using lxml.html
clean_doc = fromstring(summary_html)
print("\nSummary Text (first 200 chars):")
print(clean_doc.text_content()[:200])

view raw JSON →