HTML5 Parser for Python
html5lib is a pure-Python library for parsing HTML documents, designed to conform to the WHATWG HTML specification, as implemented by major web browsers. Its current stable version is 1.1, released in June 2020, with development on version 1.2 ongoing but unreleased. The library's release cadence is irregular, with significant time between major stable releases.
Warnings
- breaking Support for Python 2.6, 3.3, and 3.4 has been dropped in recent versions. Specifically, Python 2.6 support was removed in 1.0.1, and Python 3.3/3.4 support was removed in 1.1.
- deprecated The `html5lib` sanitizer (via `html5lib.serialize(sanitize=True)` and `html5lib.filters.sanitizer`) has been deprecated since version 1.1. Users are recommended to migrate to the `Bleach` library for HTML sanitization.
- breaking The default DOM treebuilder was removed, meaning `html5lib.treebuilders.dom` is no longer directly supported as a module. Instead, `html5lib.treebuilders.getTreeBuilder("dom")` should be used, which returns a builder using `xml.dom.minidom`.
- gotcha When using `html5lib` as a backend for `pandas.read_html()`, you might encounter `ImportError: missing optional dependency html5lib` even if `html5lib` is installed. This often happens if the `html5lib` installation is not correctly recognized by the `pandas` environment or if there are conflicts with other parsers.
- gotcha `html5lib` is a pure-Python library and can be significantly slower than alternatives like `lxml` (which is written in C). While `html5lib` provides more specification-compliant parsing, performance-critical applications might prefer `lxml` where strict HTML5 parsing isn't the absolute highest priority.
Install
-
pip install html5lib
Imports
- html5lib
import html5lib
- HTMLParser
from html5lib.html5parser import HTMLParser
- getTreeBuilder
from html5lib import getTreeBuilder
Quickstart
import html5lib
# Parse a simple HTML string
document = html5lib.parse("<p>Hello <b>World</b>!</p>")
print(f"Parsed document tag: {document.tag}")
print(f"First child's tag: {document[0].tag}")
print(f"First child's text: {document[0].text}")
# Parse with a specific treebuilder (e.g., xml.dom.minidom)
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
minidom_document = parser.parse("<p>Another <i>example</i>.</p>")
# Accessing an element using minidom structure
p_tag = minidom_document.getElementsByTagName('p')[0]
print(f"Minidom document P tag: {p_tag.tagName}")
# Example with strict parsing (raises exceptions on errors)
try:
strict_parser = html5lib.HTMLParser(strict=True)
strict_parser.parse("<div><p>Missing close tag")
except html5lib.html5parser.ParseError as e:
print(f"Strict parsing error: {e}")