{"id":770,"library":"html5lib","title":"HTML5 Parser for Python","description":"html5lib is a pure-Python library for parsing HTML documents, designed to conform to the WHATWG HTML specification, as implemented by major web browsers. Its current stable version is 1.1, released in June 2020, with development on version 1.2 ongoing but unreleased. The library's release cadence is irregular, with significant time between major stable releases.","status":"active","version":"1.1","language":"python","source_language":"en","source_url":"https://github.com/html5lib/html5lib-python","tags":["html","parser","html5","web scraping","DOM"],"install":[{"cmd":"pip install html5lib","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Supports Python 2.x and 3.x compatibility in older versions; explicitly added as a dependency for >=1.9.","package":"six"},{"reason":"Required dependency for decoding input byte streams, ensuring compliance with the Encoding Standard (since 0.99999999/1.0b9).","package":"webencodings"},{"reason":"Optional, for improved performance and lxml.etree tree format support (not on PyPy due to segfaults).","package":"lxml","optional":true},{"reason":"Optional, provides a treewalker (but not a builder).","package":"genshi","optional":true},{"reason":"Optional, for heuristic character encoding detection as a fallback.","package":"chardet","optional":true}],"imports":[{"symbol":"html5lib","correct":"import html5lib"},{"note":"While 'html5lib.HTMLParser' might work due to re-exports, direct import from 'html5lib.html5parser' is clearer and avoids potential ambiguity.","wrong":"import html5lib.HTMLParser","symbol":"HTMLParser","correct":"from html5lib.html5parser import HTMLParser"},{"note":"The `getTreeBuilder` function is re-exported at the top-level `html5lib` module for convenience.","wrong":"from html5lib.treebuilders import getTreeBuilder","symbol":"getTreeBuilder","correct":"from html5lib import getTreeBuilder"}],"quickstart":{"code":"import html5lib\n\n# Parse a simple HTML string\ndocument = html5lib.parse(\"<p>Hello <b>World</b>!</p>\")\nprint(f\"Parsed document tag: {document.tag}\")\nprint(f\"First child's tag: {document[0].tag}\")\nprint(f\"First child's text: {document[0].text}\")\n\n# Parse with a specific treebuilder (e.g., xml.dom.minidom)\nparser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder(\"dom\"))\nminidom_document = parser.parse(\"<p>Another <i>example</i>.</p>\")\n# Accessing an element using minidom structure\np_tag = minidom_document.getElementsByTagName('p')[0]\nprint(f\"Minidom document P tag: {p_tag.tagName}\")\n\n# Example with strict parsing (raises exceptions on errors)\ntry:\n    strict_parser = html5lib.HTMLParser(strict=True)\n    strict_parser.parse(\"<div><p>Missing close tag\")\nexcept html5lib.html5parser.ParseError as e:\n    print(f\"Strict parsing error: {e}\")","lang":"python","description":"This quickstart demonstrates basic HTML parsing using `html5lib.parse` for the default `xml.etree` output, and how to use `html5lib.HTMLParser` with a custom treebuilder like `xml.dom.minidom`. It also shows how to enable strict parsing to catch HTML errors. By default, `html5lib` provides an `xml.etree` element instance, but `xml.dom.minidom` and `lxml.etree` are also supported via treebuilders."},"warnings":[{"fix":"Upgrade to Python 3.5+ for html5lib 1.1+.","message":"Support for Python 2.6, 3.3, and 3.4 has been dropped in recent versions. Specifically, Python 2.6 support was removed in 1.0.1, and Python 3.3/3.4 support was removed in 1.1.","severity":"breaking","affected_versions":">=1.0.1, >=1.1"},{"fix":"Migrate to `Bleach` for sanitization. Note that `Bleach` is not a drop-in replacement and may require tuning due to different default allow lists and escaping behaviors.","message":"The `html5lib` sanitizer (via `html5lib.serialize(sanitize=True)` and `html5lib.filters.sanitizer`) has been deprecated since version 1.1. Users are recommended to migrate to the `Bleach` library for HTML sanitization.","severity":"deprecated","affected_versions":">=1.1"},{"fix":"Replace direct imports or references to `html5lib.treebuilders.dom` with `html5lib.getTreeBuilder(\"dom\")`.","message":"The default DOM treebuilder was removed, meaning `html5lib.treebuilders.dom` is no longer directly supported as a module. Instead, `html5lib.treebuilders.getTreeBuilder(\"dom\")` should be used, which returns a builder using `xml.dom.minidom`.","severity":"breaking","affected_versions":">=1.0b1"},{"fix":"Ensure `html5lib` is installed in the correct environment (e.g., using `pip install html5lib` or `conda install html5lib`). If the issue persists, try installing `BeautifulSoup4` and `lxml` alongside `html5lib` (`pip install \"pandas[html]\"`) and restarting your kernel/environment.","message":"When using `html5lib` as a backend for `pandas.read_html()`, you might encounter `ImportError: missing optional dependency html5lib` even if `html5lib` is installed. This often happens if the `html5lib` installation is not correctly recognized by the `pandas` environment or if there are conflicts with other parsers.","severity":"gotcha","affected_versions":"All versions when used with pandas"},{"fix":"For performance-sensitive applications, consider using `lxml` as the parser directly or as a treebuilder with `html5lib` if some HTML5-specific error handling is still desired. `lxml` can be used as a tree format with `html5lib` by specifying `treebuilder='lxml'`.","message":"`html5lib` is a pure-Python library and can be significantly slower than alternatives like `lxml` (which is written in C). While `html5lib` provides more specification-compliant parsing, performance-critical applications might prefer `lxml` where strict HTML5 parsing isn't the absolute highest priority.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that the input HTML document is well-formed and begins with a `<!DOCTYPE html>` declaration if a strict parsing environment is used. If parsing an HTML fragment, consider wrapping it with full `<html><body>...</body></html>` tags, potentially including a DOCTYPE. Review the `html5lib` initialization or the configuration of any library using `html5lib` to check for options that might enable a strict parsing or validation mode, and disable them if lenient parsing is desired.","message":"The `html5lib` library, or a component utilizing it, produced a 'Strict parsing error: Unexpected start tag (div). Expected DOCTYPE.' This error suggests that the parser encountered an HTML tag (like `div`) at the beginning of the document when it was strictly expecting a `<!DOCTYPE html>` declaration. While `html5lib` is generally lenient with malformed HTML and missing DOCTYPEs, this error indicates that a strict parsing mode might be enabled, or the input HTML is severely malformed for the parser's current configuration.","severity":"breaking","affected_versions":"All versions"},{"fix":"Ensure the input HTML is a complete, well-formed HTML5 document, including a correct `<!DOCTYPE html>` declaration at the very beginning. If parsing HTML fragments, ensure the appropriate API (e.g., `parseFragment` if available and applicable) is used instead of one expecting a full document. Verify that no strict parsing options are explicitly or inadvertently enabled (e.g., passing `strict=True` to `HTMLParser` or its methods).","message":"An unexpected 'Strict parsing error: Unexpected start tag (div). Expected DOCTYPE.' occurred during HTML parsing. This error suggests that the the parser encountered a 'div' tag when it was expecting a 'DOCTYPE' declaration, possibly due to malformed HTML input, re-using a parser expecting a new document, or if strict parsing was inadvertently enabled. While html5lib is generally lenient, such a specific 'strict parsing error' indicates a fundamental issue with the input structure or parser configuration.","severity":"breaking","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-05-12T18:50:11.218Z","next_check":"2026-06-27T00:00:00.000Z","problems":[{"fix":"Install the library using pip: `pip install html5lib`","cause":"The 'html5lib' library is not installed in the Python environment being used, or the Python interpreter cannot find it.","error":"ModuleNotFoundError: No module named 'html5lib'"},{"fix":"Install the library using pip: `pip install html5lib`. If it's already installed, ensure your IDE or Jupyter kernel is restarted, or that you are in the correct virtual environment.","cause":"This error occurs when a dependent library (like Pandas or BeautifulSoup) attempts to use html5lib, but it is not installed or accessible in the current Python environment.","error":"ImportError: \"html5lib not found, please install it\""},{"fix":"Upgrade BeautifulSoup4 to a compatible version (e.g., `pip install --upgrade beautifulsoup4`). If the issue persists, ensure `html5lib` is also updated: `pip install --upgrade html5lib`.","cause":"This error typically arises from an incompatibility between `html5lib` and `BeautifulSoup4` versions, where `BeautifulSoup4` expects an older internal structure of `html5lib`'s tree builders.","error":"AttributeError: module 'html5lib.treebuilders' has no attribute '_base'"},{"fix":"Ensure you are installing `html5lib` as a standalone package (`pip install html5lib`) and importing it directly as `import html5lib`, not from `pip._vendor`.","cause":"This error occurs when attempting to import `html5lib` from `pip`'s internal, vendorized modules, which is not intended for direct use by user code.","error":"ImportError: cannot import name 'html5lib' from 'pip._vendor'"},{"fix":"Inspect the HTML input for syntax errors, especially around attribute declarations. If parsing highly irregular HTML, you might set `strict=False` when initializing the parser if using `HTMLParser` directly (e.g., `parser = html5lib.HTMLParser(strict=False)`), though this is generally not recommended for robust parsing.","cause":"This specific ParseError indicates that `html5lib` encountered a character in the HTML document that it did not expect immediately after an attribute value, suggesting malformed HTML.","error":"html5lib.html5parser.ParseError: Unexpected character after attribute value"}],"ecosystem":"pypi","meta_description":null,"install_score":100,"install_tag":"verified","quickstart_score":80,"quickstart_tag":"verified","pypi_latest":"1.1","cli_name":"","cli_version":null,"install_checks":{"last_tested":"2026-05-12","tag":"verified","tag_description":"installs cleanly on critical runtimes, fast import, recently tested","installed_version":null,"pypi_latest":"1.1","is_stale":null,"results":[{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.2,"mem_mb":7.4,"disk_size":"19.1M"},{"runtime":"python:3.10-alpine","python_version":"3.10","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.21,"mem_mb":7.4,"disk_size":"19.1M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.6,"import_time_s":0.16,"mem_mb":7.4,"disk_size":"20M"},{"runtime":"python:3.10-slim","python_version":"3.10","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.16,"mem_mb":7.4,"disk_size":"20M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.28,"mem_mb":7.9,"disk_size":"21.1M"},{"runtime":"python:3.11-alpine","python_version":"3.11","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.3,"mem_mb":7.9,"disk_size":"21.1M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.8,"import_time_s":0.24,"mem_mb":7.9,"disk_size":"22M"},{"runtime":"python:3.11-slim","python_version":"3.11","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.23,"mem_mb":7.9,"disk_size":"22M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.24,"mem_mb":7.5,"disk_size":"12.9M"},{"runtime":"python:3.12-alpine","python_version":"3.12","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.45,"mem_mb":7.5,"disk_size":"12.9M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.6,"import_time_s":0.23,"mem_mb":7.5,"disk_size":"13M"},{"runtime":"python:3.12-slim","python_version":"3.12","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.24,"mem_mb":7.5,"disk_size":"13M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.22,"mem_mb":7.2,"disk_size":"12.7M"},{"runtime":"python:3.13-alpine","python_version":"3.13","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.23,"mem_mb":7.2,"disk_size":"12.6M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.6,"import_time_s":0.22,"mem_mb":7,"disk_size":"13M"},{"runtime":"python:3.13-slim","python_version":"3.13","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.24,"mem_mb":7,"disk_size":"13M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.18,"mem_mb":7.7,"disk_size":"18.5M"},{"runtime":"python:3.9-alpine","python_version":"3.9","os_libc":"alpine (musl)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.2,"mem_mb":7.7,"disk_size":"18.5M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":"wheel","failure_reason":null,"import_side_effects":null,"install_time_s":1.9,"import_time_s":0.16,"mem_mb":7.7,"disk_size":"19M"},{"runtime":"python:3.9-slim","python_version":"3.9","os_libc":"slim (glibc)","variant":"html5lib","exit_code":0,"wheel_type":null,"failure_reason":null,"import_side_effects":null,"install_time_s":null,"import_time_s":0.17,"mem_mb":7.7,"disk_size":"19M"}]},"quickstart_checks":{"last_tested":"2026-04-24","tag":"verified","tag_description":"quickstart runs on critical runtimes, recently tested","results":[{"runtime":"python:3.10-alpine","exit_code":0},{"runtime":"python:3.10-slim","exit_code":0},{"runtime":"python:3.11-alpine","exit_code":0},{"runtime":"python:3.11-slim","exit_code":0},{"runtime":"python:3.12-alpine","exit_code":0},{"runtime":"python:3.12-slim","exit_code":0},{"runtime":"python:3.13-alpine","exit_code":0},{"runtime":"python:3.13-slim","exit_code":0},{"runtime":"python:3.9-alpine","exit_code":0},{"runtime":"python:3.9-slim","exit_code":0}]}}