HTML Sanitizer

2.6.0 · active · verified Mon Apr 13

This is an allowlist-based and very opinionated HTML sanitizer for Python, designed to clean up HTML fragments from untrusted or trusted sources. It's built upon `lxml` to ensure valid and safe HTML output. Beyond basic tag and attribute allowlisting, it applies additional transforms to normalize and simplify HTML content, aiming for consistency, especially from rich text editors. It's actively maintained.

Warnings

Install

Imports

Quickstart

Initialize a `Sanitizer` object (with or without custom settings) and call its `sanitize` method with the dirty HTML string. The default configuration is restrictive, only allowing a specific set of tags and attributes.

from html_sanitizer import Sanitizer

sanitizer = Sanitizer() # Uses default configuration

dirty_html = '<p>Hello <script>alert("XSS")</script>World!</p><span style="font-weight:bold">some text</span>'
safe_html = sanitizer.sanitize(dirty_html)
print(safe_html)

# Example with custom configuration
custom_sanitizer = Sanitizer({
    'tags': {'p', 'h1', 'a'},
    'attributes': {'a': ('href', 'title')},
    'empty': set(),
    'separate': set(),
})
custom_dirty_html = '<h1>Title</h1><p>Some text. <a href="/link">Link</a> <img src="x.jpg"> </p>'
custom_safe_html = custom_sanitizer.sanitize(custom_dirty_html)
print(custom_safe_html)

view raw JSON →