HTML Sanitizer
This is an allowlist-based and very opinionated HTML sanitizer for Python, designed to clean up HTML fragments from untrusted or trusted sources. It's built upon `lxml` to ensure valid and safe HTML output. Beyond basic tag and attribute allowlisting, it applies additional transforms to normalize and simplify HTML content, aiming for consistency, especially from rich text editors. It's actively maintained.
Warnings
- gotcha The library is strictly allowlist-based and 'opinionated'. By default, many common HTML elements (like `div`, `img`) and all inline styles and scripts are removed, even if not explicitly malicious. Users must configure the `Sanitizer` instance to allow more tags/attributes.
- breaking A security vulnerability (CVE-2024-34078) was identified where specific unicode characters, when normalized, could bypass sanitization if `keep_typographic_whitespace=False` (which is the default behavior). This could lead to XSS attacks.
- gotcha The `Sanitizer` constructor performs consistency checks on provided settings. If there are conflicts (e.g., a tag is marked as `empty` but not in the `tags` allowlist), a `TypeError` will be raised.
- gotcha HTML comments are stripped by default. If preserving comments is necessary, this behavior needs to be explicitly overridden, though generally, comments in user-generated content are not considered safe.
Install
-
pip install html-sanitizer
Imports
- Sanitizer
from html_sanitizer import Sanitizer
Quickstart
from html_sanitizer import Sanitizer
sanitizer = Sanitizer() # Uses default configuration
dirty_html = '<p>Hello <script>alert("XSS")</script>World!</p><span style="font-weight:bold">some text</span>'
safe_html = sanitizer.sanitize(dirty_html)
print(safe_html)
# Example with custom configuration
custom_sanitizer = Sanitizer({
'tags': {'p', 'h1', 'a'},
'attributes': {'a': ('href', 'title')},
'empty': set(),
'separate': set(),
})
custom_dirty_html = '<h1>Title</h1><p>Some text. <a href="/link">Link</a> <img src="x.jpg"> </p>'
custom_safe_html = custom_sanitizer.sanitize(custom_dirty_html)
print(custom_safe_html)