{"id":5249,"library":"html-sanitizer","title":"HTML Sanitizer","description":"This is an allowlist-based and very opinionated HTML sanitizer for Python, designed to clean up HTML fragments from untrusted or trusted sources. It's built upon `lxml` to ensure valid and safe HTML output. Beyond basic tag and attribute allowlisting, it applies additional transforms to normalize and simplify HTML content, aiming for consistency, especially from rich text editors. It's actively maintained.","status":"active","version":"2.6.0","language":"en","source_language":"en","source_url":"https://github.com/matthiask/html-sanitizer/","tags":["html","sanitizer","xss-protection","security","lxml"],"install":[{"cmd":"pip install html-sanitizer","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core parsing and cleaning functionality relies on lxml's HTML cleaner.","package":"lxml","optional":false}],"imports":[{"symbol":"Sanitizer","correct":"from html_sanitizer import Sanitizer"}],"quickstart":{"code":"from html_sanitizer import Sanitizer\n\nsanitizer = Sanitizer() # Uses default configuration\n\ndirty_html = '<p>Hello <script>alert(\"XSS\")</script>World!</p><span style=\"font-weight:bold\">some text</span>'\nsafe_html = sanitizer.sanitize(dirty_html)\nprint(safe_html)\n\n# Example with custom configuration\ncustom_sanitizer = Sanitizer({\n    'tags': {'p', 'h1', 'a'},\n    'attributes': {'a': ('href', 'title')},\n    'empty': set(),\n    'separate': set(),\n})\ncustom_dirty_html = '<h1>Title</h1><p>Some text. <a href=\"/link\">Link</a> <img src=\"x.jpg\"> </p>'\ncustom_safe_html = custom_sanitizer.sanitize(custom_dirty_html)\nprint(custom_safe_html)","lang":"python","description":"Initialize a `Sanitizer` object (with or without custom settings) and call its `sanitize` method with the dirty HTML string. The default configuration is restrictive, only allowing a specific set of tags and attributes."},"warnings":[{"fix":"Always review the default `Sanitizer` settings. Customize `tags`, `attributes`, and other options when initializing `Sanitizer` to fit your specific use case. Refer to the documentation for available settings.","message":"The library is strictly allowlist-based and 'opinionated'. By default, many common HTML elements (like `div`, `img`) and all inline styles and scripts are removed, even if not explicitly malicious. Users must configure the `Sanitizer` instance to allow more tags/attributes.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade to version 2.4.2 or higher to receive the fix for this vulnerability. Ensure your deployment pipelines automatically update or flag vulnerable versions.","message":"A security vulnerability (CVE-2024-34078) was identified where specific unicode characters, when normalized, could bypass sanitization if `keep_typographic_whitespace=False` (which is the default behavior). This could lead to XSS attacks.","severity":"breaking","affected_versions":"<2.4.2"},{"fix":"Carefully define your `Sanitizer` settings to ensure logical consistency. For instance, any tag listed in `empty` or `separate` must also be present in the `tags` allowlist.","message":"The `Sanitizer` constructor performs consistency checks on provided settings. If there are conflicts (e.g., a tag is marked as `empty` but not in the `tags` allowlist), a `TypeError` will be raised.","severity":"gotcha","affected_versions":"All versions"},{"fix":"While generally recommended to strip comments for security and cleanliness, if you must preserve them, check the `Sanitizer` documentation for an option to retain them (e.g., `strip_comments=False` if available in your version's configuration).","message":"HTML comments are stripped by default. If preserving comments is necessary, this behavior needs to be explicitly overridden, though generally, comments in user-generated content are not considered safe.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}