HTML5 Parser for Python

raw JSON →
1.1 verified Tue May 12 auth: no python install: verified quickstart: verified

html5lib is a pure-Python library for parsing HTML documents, designed to conform to the WHATWG HTML specification, as implemented by major web browsers. Its current stable version is 1.1, released in June 2020, with development on version 1.2 ongoing but unreleased. The library's release cadence is irregular, with significant time between major stable releases.

pip install html5lib
error ModuleNotFoundError: No module named 'html5lib'
cause The 'html5lib' library is not installed in the Python environment being used, or the Python interpreter cannot find it.
fix
Install the library using pip: pip install html5lib
error ImportError: "html5lib not found, please install it"
cause This error occurs when a dependent library (like Pandas or BeautifulSoup) attempts to use html5lib, but it is not installed or accessible in the current Python environment.
fix
Install the library using pip: pip install html5lib. If it's already installed, ensure your IDE or Jupyter kernel is restarted, or that you are in the correct virtual environment.
error AttributeError: module 'html5lib.treebuilders' has no attribute '_base'
cause This error typically arises from an incompatibility between `html5lib` and `BeautifulSoup4` versions, where `BeautifulSoup4` expects an older internal structure of `html5lib`'s tree builders.
fix
Upgrade BeautifulSoup4 to a compatible version (e.g., pip install --upgrade beautifulsoup4). If the issue persists, ensure html5lib is also updated: pip install --upgrade html5lib.
error ImportError: cannot import name 'html5lib' from 'pip._vendor'
cause This error occurs when attempting to import `html5lib` from `pip`'s internal, vendorized modules, which is not intended for direct use by user code.
fix
Ensure you are installing html5lib as a standalone package (pip install html5lib) and importing it directly as import html5lib, not from pip._vendor.
error html5lib.html5parser.ParseError: Unexpected character after attribute value
cause This specific ParseError indicates that `html5lib` encountered a character in the HTML document that it did not expect immediately after an attribute value, suggesting malformed HTML.
fix
Inspect the HTML input for syntax errors, especially around attribute declarations. If parsing highly irregular HTML, you might set strict=False when initializing the parser if using HTMLParser directly (e.g., parser = html5lib.HTMLParser(strict=False)), though this is generally not recommended for robust parsing.
breaking Support for Python 2.6, 3.3, and 3.4 has been dropped in recent versions. Specifically, Python 2.6 support was removed in 1.0.1, and Python 3.3/3.4 support was removed in 1.1.
fix Upgrade to Python 3.5+ for html5lib 1.1+.
deprecated The `html5lib` sanitizer (via `html5lib.serialize(sanitize=True)` and `html5lib.filters.sanitizer`) has been deprecated since version 1.1. Users are recommended to migrate to the `Bleach` library for HTML sanitization.
fix Migrate to `Bleach` for sanitization. Note that `Bleach` is not a drop-in replacement and may require tuning due to different default allow lists and escaping behaviors.
breaking The default DOM treebuilder was removed, meaning `html5lib.treebuilders.dom` is no longer directly supported as a module. Instead, `html5lib.treebuilders.getTreeBuilder("dom")` should be used, which returns a builder using `xml.dom.minidom`.
fix Replace direct imports or references to `html5lib.treebuilders.dom` with `html5lib.getTreeBuilder("dom")`.
gotcha When using `html5lib` as a backend for `pandas.read_html()`, you might encounter `ImportError: missing optional dependency html5lib` even if `html5lib` is installed. This often happens if the `html5lib` installation is not correctly recognized by the `pandas` environment or if there are conflicts with other parsers.
fix Ensure `html5lib` is installed in the correct environment (e.g., using `pip install html5lib` or `conda install html5lib`). If the issue persists, try installing `BeautifulSoup4` and `lxml` alongside `html5lib` (`pip install "pandas[html]"`) and restarting your kernel/environment.
gotcha `html5lib` is a pure-Python library and can be significantly slower than alternatives like `lxml` (which is written in C). While `html5lib` provides more specification-compliant parsing, performance-critical applications might prefer `lxml` where strict HTML5 parsing isn't the absolute highest priority.
fix For performance-sensitive applications, consider using `lxml` as the parser directly or as a treebuilder with `html5lib` if some HTML5-specific error handling is still desired. `lxml` can be used as a tree format with `html5lib` by specifying `treebuilder='lxml'`.
breaking The `html5lib` library, or a component utilizing it, produced a 'Strict parsing error: Unexpected start tag (div). Expected DOCTYPE.' This error suggests that the parser encountered an HTML tag (like `div`) at the beginning of the document when it was strictly expecting a `<!DOCTYPE html>` declaration. While `html5lib` is generally lenient with malformed HTML and missing DOCTYPEs, this error indicates that a strict parsing mode might be enabled, or the input HTML is severely malformed for the parser's current configuration.
fix Ensure that the input HTML document is well-formed and begins with a `<!DOCTYPE html>` declaration if a strict parsing environment is used. If parsing an HTML fragment, consider wrapping it with full `<html><body>...</body></html>` tags, potentially including a DOCTYPE. Review the `html5lib` initialization or the configuration of any library using `html5lib` to check for options that might enable a strict parsing or validation mode, and disable them if lenient parsing is desired.
breaking An unexpected 'Strict parsing error: Unexpected start tag (div). Expected DOCTYPE.' occurred during HTML parsing. This error suggests that the the parser encountered a 'div' tag when it was expecting a 'DOCTYPE' declaration, possibly due to malformed HTML input, re-using a parser expecting a new document, or if strict parsing was inadvertently enabled. While html5lib is generally lenient, such a specific 'strict parsing error' indicates a fundamental issue with the input structure or parser configuration.
fix Ensure the input HTML is a complete, well-formed HTML5 document, including a correct `<!DOCTYPE html>` declaration at the very beginning. If parsing HTML fragments, ensure the appropriate API (e.g., `parseFragment` if available and applicable) is used instead of one expecting a full document. Verify that no strict parsing options are explicitly or inadvertently enabled (e.g., passing `strict=True` to `HTMLParser` or its methods).
python os / libc status wheel install import disk
3.10 alpine (musl) wheel - 0.20s 19.1M
3.10 alpine (musl) - - 0.21s 19.1M
3.10 slim (glibc) wheel 1.6s 0.16s 20M
3.10 slim (glibc) - - 0.16s 20M
3.11 alpine (musl) wheel - 0.28s 21.1M
3.11 alpine (musl) - - 0.30s 21.1M
3.11 slim (glibc) wheel 1.8s 0.24s 22M
3.11 slim (glibc) - - 0.23s 22M
3.12 alpine (musl) wheel - 0.24s 12.9M
3.12 alpine (musl) - - 0.45s 12.9M
3.12 slim (glibc) wheel 1.6s 0.23s 13M
3.12 slim (glibc) - - 0.24s 13M
3.13 alpine (musl) wheel - 0.22s 12.7M
3.13 alpine (musl) - - 0.23s 12.6M
3.13 slim (glibc) wheel 1.6s 0.22s 13M
3.13 slim (glibc) - - 0.24s 13M
3.9 alpine (musl) wheel - 0.18s 18.5M
3.9 alpine (musl) - - 0.20s 18.5M
3.9 slim (glibc) wheel 1.9s 0.16s 19M
3.9 slim (glibc) - - 0.17s 19M

This quickstart demonstrates basic HTML parsing using `html5lib.parse` for the default `xml.etree` output, and how to use `html5lib.HTMLParser` with a custom treebuilder like `xml.dom.minidom`. It also shows how to enable strict parsing to catch HTML errors. By default, `html5lib` provides an `xml.etree` element instance, but `xml.dom.minidom` and `lxml.etree` are also supported via treebuilders.

import html5lib

# Parse a simple HTML string
document = html5lib.parse("<p>Hello <b>World</b>!</p>")
print(f"Parsed document tag: {document.tag}")
print(f"First child's tag: {document[0].tag}")
print(f"First child's text: {document[0].text}")

# Parse with a specific treebuilder (e.g., xml.dom.minidom)
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Another <i>example</i>.</p>")
# Accessing an element using minidom structure
p_tag = minidom_document.getElementsByTagName('p')[0]
print(f"Minidom document P tag: {p_tag.tagName}")

# Example with strict parsing (raises exceptions on errors)
try:
    strict_parser = html5lib.HTMLParser(strict=True)
    strict_parser.parse("<div><p>Missing close tag")
except html5lib.html5parser.ParseError as e:
    print(f"Strict parsing error: {e}")