lxml-html-clean
lxml-html-clean is a Python library that provides a robust HTML cleaning utility, originally part of the `lxml` project. It helps remove unwanted tags, attributes, and scripts from HTML content to sanitize it, protecting against XSS and other vulnerabilities. The current version is 0.4.4. It follows a low release cadence, typically for bug fixes or minor improvements.
Warnings
- deprecated The `lxml.html.clean.Cleaner` class, which provided HTML cleaning functionality directly within the `lxml` library, is now considered deprecated. Users are strongly encouraged to migrate to this standalone `lxml-html-clean` package for future maintenance and updates.
- gotcha The `clean_html` method's return type is dynamic. If you pass a string as input, it returns a string. If you pass an `lxml.etree._Element` or `lxml.html.HtmlElement`, it returns an `lxml.html.HtmlElement`. This can be a footgun for type-sensitive code or when expecting a consistent output type.
- gotcha `HtmlCleaner` applies aggressive default cleaning settings (e.g., removing scripts, styles, links, comments, and unknown tags). If not explicitly configured, it might strip more content than desired. Users often need to precisely define `allow_tags`, `remove_tags`, `kill_tags`, `safe_attrs_only`, and other boolean flags.
Install
-
pip install lxml-html-clean
Imports
- HtmlCleaner
from lxml_html_clean import HtmlCleaner
Quickstart
from lxml_html_clean import HtmlCleaner
html_content = """
<html>
<head><title>Test</title></head>
<body>
<script>alert('xss');</script>
<p style="color:red;">Hello <b>World</b>!</p>
<a href="javascript:alert('bad');">Click me</a>
<img src="data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==">
<iframe></iframe>
</body>
</html>
"""
# Configure the cleaner to allow specific tags but remove scripts and styles
cleaner = HtmlCleaner(
allow_tags=['p', 'b', 'img'],
remove_tags=['script', 'iframe'],
kill_tags=['style'],
safe_attrs_only=True, # Remove potentially unsafe attributes
forms=False, # Remove form tags
scripts=True, # Remove script tags
comments=True, # Remove HTML comments
style=True, # Remove style tags
links=True, # Remove link tags (e.g., <link rel='stylesheet'>)
page_structure=False # Do not remove html, head, body tags
)
cleaned_html = cleaner.clean_html(html_content)
print("--- Original HTML ---")
print(html_content)
print("\n--- Cleaned HTML ---")
print(cleaned_html)