lxml-html-clean

0.4.4 · active · verified Thu Apr 09

lxml-html-clean is a Python library that provides a robust HTML cleaning utility, originally part of the `lxml` project. It helps remove unwanted tags, attributes, and scripts from HTML content to sanitize it, protecting against XSS and other vulnerabilities. The current version is 0.4.4. It follows a low release cadence, typically for bug fixes or minor improvements.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to instantiate `HtmlCleaner` with specific configurations to sanitize HTML content, removing unwanted elements like scripts and iframes while preserving allowed tags and cleaning attributes. It highlights common configuration options for effective HTML sanitization.

from lxml_html_clean import HtmlCleaner

html_content = """
<html>
    <head><title>Test</title></head>
    <body>
        <script>alert('xss');</script>
        <p style="color:red;">Hello <b>World</b>!</p>
        <a href="javascript:alert('bad');">Click me</a>
        <img src="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
        <iframe></iframe>
    </body>
</html>
"""

# Configure the cleaner to allow specific tags but remove scripts and styles
cleaner = HtmlCleaner(
    allow_tags=['p', 'b', 'img'],
    remove_tags=['script', 'iframe'],
    kill_tags=['style'],
    safe_attrs_only=True, # Remove potentially unsafe attributes
    forms=False,          # Remove form tags
    scripts=True,         # Remove script tags
    comments=True,        # Remove HTML comments
    style=True,           # Remove style tags
    links=True,           # Remove link tags (e.g., <link rel='stylesheet'>)
    page_structure=False  # Do not remove html, head, body tags
)

cleaned_html = cleaner.clean_html(html_content)

print("--- Original HTML ---")
print(html_content)
print("\n--- Cleaned HTML ---")
print(cleaned_html)

view raw JSON →