{"id":1541,"library":"lxml-html-clean","title":"lxml-html-clean","description":"lxml-html-clean is a Python library that provides a robust HTML cleaning utility, originally part of the `lxml` project. It helps remove unwanted tags, attributes, and scripts from HTML content to sanitize it, protecting against XSS and other vulnerabilities. The current version is 0.4.4. It follows a low release cadence, typically for bug fixes or minor improvements.","status":"active","version":"0.4.4","language":"en","source_language":"en","source_url":"https://github.com/fedora-python/lxml_html_clean/","tags":["html","sanitizer","cleaner","lxml","security","xss","html-cleaning"],"install":[{"cmd":"pip install lxml-html-clean","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core dependency for HTML parsing and manipulation.","package":"lxml","optional":false}],"imports":[{"note":"The functionality was moved to this standalone package, and direct import from lxml.html.clean is deprecated.","wrong":"from lxml.html.clean import Cleaner","symbol":"HtmlCleaner","correct":"from lxml_html_clean import HtmlCleaner"}],"quickstart":{"code":"from lxml_html_clean import HtmlCleaner\n\nhtml_content = \"\"\"\n<html>\n    <head><title>Test</title></head>\n    <body>\n        <script>alert('xss');</script>\n        <p style=\"color:red;\">Hello <b>World</b>!</p>\n        <a href=\"javascript:alert('bad');\">Click me</a>\n        <img src=\"data:image/gif;base64,R0lGODlhAQABAIAAAP///wAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw==\">\n        <iframe></iframe>\n    </body>\n</html>\n\"\"\"\n\n# Configure the cleaner to allow specific tags but remove scripts and styles\ncleaner = HtmlCleaner(\n    allow_tags=['p', 'b', 'img'],\n    remove_tags=['script', 'iframe'],\n    kill_tags=['style'],\n    safe_attrs_only=True, # Remove potentially unsafe attributes\n    forms=False,          # Remove form tags\n    scripts=True,         # Remove script tags\n    comments=True,        # Remove HTML comments\n    style=True,           # Remove style tags\n    links=True,           # Remove link tags (e.g., <link rel='stylesheet'>)\n    page_structure=False  # Do not remove html, head, body tags\n)\n\ncleaned_html = cleaner.clean_html(html_content)\n\nprint(\"--- Original HTML ---\")\nprint(html_content)\nprint(\"\\n--- Cleaned HTML ---\")\nprint(cleaned_html)","lang":"python","description":"This quickstart demonstrates how to instantiate `HtmlCleaner` with specific configurations to sanitize HTML content, removing unwanted elements like scripts and iframes while preserving allowed tags and cleaning attributes. It highlights common configuration options for effective HTML sanitization."},"warnings":[{"fix":"Replace `from lxml.html.clean import Cleaner` with `from lxml_html_clean import HtmlCleaner`. Ensure `lxml-html-clean` is installed via `pip install lxml-html-clean`.","message":"The `lxml.html.clean.Cleaner` class, which provided HTML cleaning functionality directly within the `lxml` library, is now considered deprecated. Users are strongly encouraged to migrate to this standalone `lxml-html-clean` package for future maintenance and updates.","severity":"deprecated","affected_versions":"lxml < 4.9.0 (functionality still exists but is superseded); lxml-html-clean all versions"},{"fix":"Be explicit about parsing and serializing if you need a specific type. If you need an `lxml` element back, parse the HTML with `lxml.html.fromstring` first, then clean the element, and serialize it back to a string with `lxml.html.tostring` if needed. Example: `tree = lxml.html.fromstring(html_str); cleaned_tree = cleaner.clean_html(tree); cleaned_str = lxml.html.tostring(cleaned_tree).decode()`.","message":"The `clean_html` method's return type is dynamic. If you pass a string as input, it returns a string. If you pass an `lxml.etree._Element` or `lxml.html.HtmlElement`, it returns an `lxml.html.HtmlElement`. This can be a footgun for type-sensitive code or when expecting a consistent output type.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always review and configure `HtmlCleaner` parameters in its constructor (`HtmlCleaner(...)`) to match your exact content requirements and security policies. Start with a clear understanding of what you want to permit versus remove.","message":"`HtmlCleaner` applies aggressive default cleaning settings (e.g., removing scripts, styles, links, comments, and unknown tags). If not explicitly configured, it might strip more content than desired. Users often need to precisely define `allow_tags`, `remove_tags`, `kill_tags`, `safe_attrs_only`, and other boolean flags.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}